[patch 11/13] signal/timer/event fds v10 - eventfd wire up i386 arch ...

2007-04-02 Thread Davide Libenzi
This patch wire the eventfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide


Index: linux-2.6.21-rc5.fds/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc5.fds.orig/arch/i386/kernel/syscall_table.S  2007-04-02 
15:06:39.0 -0700
+++ linux-2.6.21-rc5.fds/arch/i386/kernel/syscall_table.S   2007-04-02 
15:06:44.0 -0700
@@ -321,3 +321,4 @@
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
.long sys_timerfd
+   .long sys_eventfd
Index: linux-2.6.21-rc5.fds/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc5.fds.orig/include/asm-i386/unistd.h 2007-04-02 
15:06:39.0 -0700
+++ linux-2.6.21-rc5.fds/include/asm-i386/unistd.h  2007-04-02 
15:06:44.0 -0700
@@ -327,10 +327,11 @@
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
 #define __NR_timerfd   321
+#define __NR_eventfd   322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 322
+#define NR_syscalls 323
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/13] signal/timer/event fds v10 - signalfd core ...

2007-04-02 Thread Davide Libenzi

ChangeLog:

v10 - Renamed from aino to anon_inode

--
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how
badly it can be broken :), and I added even more breakage ;)
Signals are fetched from the same signal queue used by the process,
so signalfd will compete with standard kernel delivery in dequeue_signal().
If you want to reliably fetch signals on the signalfd file, you need to
block them with sigprocmask(SIG_BLOCK).
This seems to be working fine on my Dual Opteron machine. I made a quick 
test program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file 
descriptor receiver. The signalfd file descriptor if created with the 
following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The ufd parameter allows to change an existing signalfd sigmask, w/out 
going to close/create cycle (Linus idea). Use ufd == -1 if you want a 
brand new signalfd file.
The mask allows to specify the signal mask of signals that we are 
interested in. The masksize parameter is the size of mask.
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a struct signalfd_siginfo structure
in the userspace supplied buffer. The return value is the number of bytes
copied in the supplied buffer, or -1 in case of error. The read(2) call
can also return 0, in case the sighand structure to which the signalfd
was attached, has been orphaned. The O_NONBLOCK flag is also supported, and
read(2) will return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than
sizeof(struct signalfd_siginfo), -EINVAL is returned. A read from the
signalfd can also return -ERESTARTSYS in case a signal hits the process.
The format of the struct signalfd_siginfo is, and the valid fields depends
of the (-code  __SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
__u32 signo;/* si_signo */
__s32 err;  /* si_errno */
__s32 code; /* si_code */
__u32 pid;  /* si_pid */
__u32 uid;  /* si_uid */
__s32 fd;   /* si_fd */
__u32 tid;  /* si_fd */
__u32 band; /* si_band */
__u32 overrun;  /* si_overrun */
__u32 trapno;   /* si_trapno */
__s32 status;   /* si_status */
__s32 svint;/* si_int */
__u64 svptr;/* si_ptr */
__u64 utime;/* si_utime */
__u64 stime;/* si_stime */
__u64 addr; /* si_addr */
};



Signed-off-by: Davide Libenzi davidel@xmailserver.org



- Davide



Index: linux-2.6.21-rc5.fds/fs/signalfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5.fds/fs/signalfd.c  2007-04-02 15:06:29.0 -0700
@@ -0,0 +1,353 @@
+/*
+ *  fs/signalfd.c
+ *
+ *  Copyright (C) 2003  Linus Torvalds
+ *
+ *  Mon Mar 5, 2007: Davide Libenzi davidel@xmailserver.org
+ *  Changed -read() to return a siginfo strcture instead of signal number.
+ *  Fixed locking in -poll().
+ *  Added sighand-detach notification.
+ *  Added fd re-use in sys_signalfd() syscall.
+ *  Now using anonymous inode source.
+ *  Thanks to Oleg Nesterov for useful code review and suggestions.
+ *  More comments and suggestions from Arnd Bergmann.
+ */
+
+#include linux/file.h
+#include linux/poll.h
+#include linux/slab.h
+#include linux/init.h
+#include linux/fs.h
+#include linux/mount.h
+#include linux/module.h
+#include linux/kernel.h
+#include linux/signal.h
+#include linux/list.h
+#include linux/anon_inodes.h
+#include linux/signalfd.h
+
+#include asm/uaccess.h
+
+struct signalfd_ctx {
+   struct list_head lnk;
+   wait_queue_head_t wqh;
+   sigset_t sigmask;
+   struct task_struct *tsk;
+};
+
+struct signalfd_lockctx {
+   struct task_struct *tsk;
+   unsigned long flags;
+};
+
+/*
+ * Tries to acquire the sighand lock. We do not increment the sighand
+ * use count, and we do not even pin the task struct, so we need to
+ * do it inside an RCU read lock, and we must be prepared for the
+ * ctx-tsk going to NULL (in signalfd_deliver()), and for the sighand
+ * being detached. We return 0 if the sighand has been detached, or
+ * 1 if we were able to pin the sighand lock.
+ */
+static int signalfd_lock(struct signalfd_ctx *ctx, struct signalfd_lockctx *lk)
+{
+   struct sighand_struct *sighand = NULL;
+
+   rcu_read_lock();
+   lk-tsk = rcu_dereference(ctx-tsk);
+   if (likely(lk-tsk != NULL))
+   sighand = lock_task_sighand(lk-tsk, lk-flags);
+   rcu_read_unlock();
+
+   if (sighand  !ctx-tsk

Re: getting processor numbers

2007-04-03 Thread Davide Libenzi
On Tue, 3 Apr 2007, Ulrich Drepper wrote:

 More and more code depends on knowing the number of processors in the
 system to efficiently scale the code.  E.g., in OpenMP it is used by
 default to determine how many threads to create.  Creating more threads
 than there are processors/cores doesn't make sense.
 
 glibc for a long time provides functionality to retrieve the number
 through sysconf() and this is what fortunately most programs use.  The
 problem is that we are currently using /proc/cpuinfo since this is all
 there was available at that time.  Creating /proc/cpuinfo takes the
 kernel quite a long time, unfortunately (I think Jakub said it is mainly
 the interrupt information).
 
 The alternative today is to use /sys/devices/system/cpu and count the
 number of cpu* directories in it.  This is somewhat faster.  But there
 would be another possibility: simply stat /sys/devices/system/cpu and
 use st_nlink - 2.
 
 This last step unfortunately it made impossible by recent changes:
 
   http://article.gmane.org/gmane.linux.kernel/413178
 
 I would like to propose changing that patch, move the sched_*
 pseudo-files in some other directly and permanently ban putting any new
 file into /sys/devices/system/cpu.
 
 To get some numbers, you can try
 
   http://people.redhat.com/drepper/nproc-timing.c
 
 The numbers I see on x86-64:
 
 cpuinfo 10145810 cycles for 100 accesses
 readdir /sys 3113870 cycles for 100 accesses
 stat /sys 741070 cycles for 100 accesses

It sucks when seen from a micro-bench POV, but does it really matter 
overall? The vast majority of software usually calls 
sysconf(_SC_NPROCESSORS_*) with very little frequency (mostly once at 
initialization time) anyway. That's what 50us / call?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc5-mm4

2007-04-03 Thread Davide Libenzi
On Tue, 3 Apr 2007, Randy Dunlap wrote:

 On Mon, 2 Apr 2007 22:47:45 -0700 Andrew Morton wrote:
 
  
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/
  
  - The oops in git-net.patch has been fixed, so that tree has been restored. 
It is huge.
  
  - Added the device-mapper development tree to the -mm lineup (Alasdair
Kergon).  It is a quilt tree, living at
ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/.
  
  - Added davidel's signalfd stuff.
   ~~
 
 with those new CONFIGs=n:  (we are discussing it)
 
 arch/x86_64/kernel/built-in.o:(.rodata+0x18a8): undefined reference to 
 `sys_signal
 fd'
 arch/x86_64/kernel/built-in.o:(.rodata+0x18b0): undefined reference to 
 `sys_timerf
 d'
 arch/x86_64/kernel/built-in.o:(.rodata+0x18b8): undefined reference to 
 `sys_eventf
 d'
 arch/x86_64/ia32/ia32entry.S:(.rodata+0xa08): undefined reference to 
 `sys_signalfd
 '
 arch/x86_64/ia32/ia32entry.S:(.rodata+0xa10): undefined reference to 
 `sys_timerfd'
 arch/x86_64/ia32/ia32entry.S:(.rodata+0xa18): undefined reference to 
 `sys_eventfd'
 make: *** [.tmp_vmlinux1] Error 1

I'll be sending Andrew patches against -mm4. The ones that fix the above, 
and the one for the include-files diet.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 5/5] new fds fixes for 2.6.21-rc5.mm4 - eventfd include files diet ...

2007-04-03 Thread Davide Libenzi
Removes a few unneeded include files from the eventfd code.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/eventfd.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/eventfd.c  2007-04-03 13:17:25.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/eventfd.c   2007-04-03 13:17:39.0 -0700
@@ -7,20 +7,15 @@
 
 #include linux/file.h
 #include linux/poll.h
-#include linux/slab.h
 #include linux/init.h
 #include linux/fs.h
-#include linux/mount.h
-#include linux/module.h
+#include linux/sched.h
 #include linux/kernel.h
 #include linux/list.h
 #include linux/spinlock.h
-#include linux/jiffies.h
 #include linux/anon_inodes.h
 #include linux/eventfd.h
 
-#include asm/uaccess.h
-
 struct eventfd_ctx {
spinlock_t lock;
wait_queue_head_t wqh;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/5] new fds fixes for 2.6.21-rc5.mm4 - fix a spelling error ...

2007-04-03 Thread Davide Libenzi
Fixes a spelling error inside init/Kconfig.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide


Index: linux-2.6.21-rc5.mm4/init/Kconfig
===
--- linux-2.6.21-rc5.mm4.orig/init/Kconfig  2007-04-03 13:17:25.0 
-0700
+++ linux-2.6.21-rc5.mm4/init/Kconfig   2007-04-03 13:17:35.0 -0700
@@ -479,7 +479,7 @@
  run glibc-based applications correctly.
 
 config ANON_INODES
-   bool Enable anynimous inode source if EMBEDDED
+   bool Enable anonymous inode source if EMBEDDED
default y
help
  Anonymous inode source for pseudo-files like epoll, signalfd,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/5] new fds fixes for 2.6.21-rc5.mm4 - timerfd include files diet ...

2007-04-03 Thread Davide Libenzi
Removes a few unneeded include files from the timerfd code.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/timerfd.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/timerfd.c  2007-04-03 13:17:25.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/timerfd.c   2007-04-03 13:17:38.0 -0700
@@ -10,23 +10,17 @@
 
 #include linux/file.h
 #include linux/poll.h
-#include linux/slab.h
 #include linux/init.h
 #include linux/fs.h
-#include linux/mount.h
-#include linux/module.h
+#include linux/sched.h
 #include linux/kernel.h
-#include linux/signal.h
 #include linux/list.h
 #include linux/spinlock.h
 #include linux/time.h
 #include linux/hrtimer.h
-#include linux/jiffies.h
 #include linux/anon_inodes.h
 #include linux/timerfd.h
 
-#include asm/uaccess.h
-
 struct timerfd_ctx {
struct hrtimer tmr;
ktime_t tintv;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/5] new fds fixes for 2.6.21-rc5.mm4 - fix a build error on x86_64 when the new CONFIG_* are not set ...

2007-04-03 Thread Davide Libenzi
Fixes a build error on x86_64 that happens when the new CONFIG_* options
for the new fds are not set



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/kernel/sys_ni.c
===
--- linux-2.6.21-rc5.mm4.orig/kernel/sys_ni.c   2007-04-03 13:17:25.0 
-0700
+++ linux-2.6.21-rc5.mm4/kernel/sys_ni.c2007-04-03 13:17:30.0 
-0700
@@ -146,3 +146,9 @@
 cond_syscall(sys_bdflush);
 cond_syscall(sys_ioprio_set);
 cond_syscall(sys_ioprio_get);
+
+/* New file descriptors */
+cond_syscall(sys_signalfd);
+cond_syscall(sys_timerfd);
+cond_syscall(sys_eventfd);
+

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/5] new fds fixes for 2.6.21-rc5.mm4 - signalfd include files diet ...

2007-04-03 Thread Davide Libenzi
Removes a few unneeded include files from the signalfd code.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/signalfd.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/signalfd.c 2007-04-03 13:17:25.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/signalfd.c  2007-04-03 13:17:36.0 -0700
@@ -15,19 +15,15 @@
 
 #include linux/file.h
 #include linux/poll.h
-#include linux/slab.h
 #include linux/init.h
 #include linux/fs.h
-#include linux/mount.h
-#include linux/module.h
+#include linux/sched.h
 #include linux/kernel.h
 #include linux/signal.h
 #include linux/list.h
 #include linux/anon_inodes.h
 #include linux/signalfd.h
 
-#include asm/uaccess.h
-
 struct signalfd_ctx {
struct list_head lnk;
wait_queue_head_t wqh;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/3] epoll cleanups - epoll include diet ...

2007-04-03 Thread Davide Libenzi
Remove some unneeded include files from epoll code.


Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 17:59:54.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:30.0 -0700
@@ -1,6 +1,6 @@
 /*
- *  fs/eventpoll.c ( Efficent event polling implementation )
- *  Copyright (C) 2001,...,2006 Davide Libenzi
+ *  fs/eventpoll.c (Efficent event notification implementation)
+ *  Copyright (C) 2001,...,2007 Davide Libenzi
  *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
@@ -17,30 +17,21 @@
 #include linux/sched.h
 #include linux/fs.h
 #include linux/file.h
-#include linux/signal.h
 #include linux/errno.h
-#include linux/mm.h
 #include linux/slab.h
 #include linux/poll.h
 #include linux/string.h
 #include linux/list.h
 #include linux/hash.h
 #include linux/spinlock.h
-#include linux/syscalls.h
 #include linux/rwsem.h
 #include linux/rbtree.h
 #include linux/wait.h
 #include linux/eventpoll.h
-#include linux/mount.h
-#include linux/bitops.h
 #include linux/mutex.h
 #include linux/anon_inodes.h
 #include asm/uaccess.h
-#include asm/system.h
-#include asm/io.h
-#include asm/mman.h
 #include asm/atomic.h
-#include asm/semaphore.h
 
 
 /*

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/3] epoll cleanups - epoll no module ...

2007-04-03 Thread Davide Libenzi
Epoll is either compiled it, or not (if EMBEDDED). Remove the module code
and use fs_initcall().



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 18:33:30.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:40.0 -0700
@@ -11,7 +11,6 @@
  *
  */
 
-#include linux/module.h
 #include linux/init.h
 #include linux/kernel.h
 #include linux/sched.h
@@ -1462,16 +1461,5 @@
 
return 0;
 }
+fs_initcall(eventpoll_init);
 
-
-static void __exit eventpoll_exit(void)
-{
-   /* Undo all operations done inside eventpoll_init() */
-   kmem_cache_destroy(pwq_cache);
-   kmem_cache_destroy(epi_cache);
-}
-
-module_init(eventpoll_init);
-module_exit(eventpoll_exit);
-
-MODULE_LICENSE(GPL);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/3] epoll cleanups - epoll remove static pre-declarations and akpm-ize the code ...

2007-04-03 Thread Davide Libenzi
Re-arrange epoll code to avoid static functions pre-declarations, and apply
akpm-filter on it.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c
===
--- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 18:33:40.0 
-0700
+++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:42.0 -0700
@@ -32,7 +32,6 @@
 #include asm/uaccess.h
 #include asm/atomic.h
 
-
 /*
  * LOCKING:
  * There are three level of locking required by epoll :
@@ -65,7 +64,6 @@
  * a greater scalability.
  */
 
-
 #define DEBUG_EPOLL 0
 
 #if DEBUG_EPOLL  0
@@ -95,7 +93,6 @@
 
 #define EP_MAX_EVENTS (INT_MAX / sizeof(struct epoll_event))
 
-
 struct epoll_filefd {
struct file *file;
int fd;
@@ -213,36 +210,6 @@
struct epitem *epi;
 };
 
-
-
-static void ep_poll_safewake_init(struct poll_safewake *psw);
-static void ep_poll_safewake(struct poll_safewake *psw, wait_queue_head_t *wq);
-static int ep_alloc(struct eventpoll **pep);
-static void ep_free(struct eventpoll *ep);
-static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd);
-static void ep_use_epitem(struct epitem *epi);
-static void ep_release_epitem(struct epitem *epi);
-static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
-poll_table *pt);
-static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi);
-static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
-struct file *tfile, int fd);
-static int ep_modify(struct eventpoll *ep, struct epitem *epi,
-struct epoll_event *event);
-static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi);
-static int ep_unlink(struct eventpoll *ep, struct epitem *epi);
-static int ep_remove(struct eventpoll *ep, struct epitem *epi);
-static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void 
*key);
-static int ep_eventpoll_close(struct inode *inode, struct file *file);
-static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait);
-static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,
- struct epoll_event __user *events, int maxevents);
-static int ep_events_transfer(struct eventpoll *ep,
- struct epoll_event __user *events,
- int maxevents);
-static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-  int maxevents, long timeout);
-
 /*
  * This semaphore is used to serialize ep_free() and eventpoll_release_file().
  */
@@ -257,19 +224,6 @@
 /* Slab cache used to allocate struct eppoll_entry */
 static struct kmem_cache *pwq_cache __read_mostly;
 
-/* File callbacks that implement the eventpoll file behaviour */
-static const struct file_operations eventpoll_fops = {
-   .release= ep_eventpoll_close,
-   .poll   = ep_eventpoll_poll
-};
-
-
-
-/* Fast test to see if the file is an evenpoll file */
-static inline int is_file_epoll(struct file *f)
-{
-   return f-f_op == eventpoll_fops;
-}
 
 /* Setup the structure that is used as key for the rb-tree */
 static inline void ep_set_ffd(struct epoll_filefd *ffd,
@@ -338,7 +292,6 @@
spin_lock_init(psw-lock);
 }
 
-
 /*
  * Perform a safe wake up of the poll wait list. The problem is that
  * with the new callback'd wake up system, it is possible that the
@@ -393,303 +346,265 @@
spin_unlock_irqrestore(psw-lock, flags);
 }
 
-
 /*
- * This is called from eventpoll_release() to unlink files from the eventpoll
- * interface. We need to have this facility to cleanup correctly files that are
- * closed without being removed from the eventpoll interface.
+ * This function unregister poll callbacks from the associated file descriptor.
+ * Since this must be called without holding ep-lock the atomic exchange 
trick
+ * will protect us from multiple unregister.
  */
-void eventpoll_release_file(struct file *file)
+static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 {
-   struct list_head *lsthead = file-f_ep_links;
-   struct eventpoll *ep;
-   struct epitem *epi;
+   int nwait;
+   struct list_head *lsthead = epi-pwqlist;
+   struct eppoll_entry *pwq;
 
-   /*
-* We don't want to get file-f_ep_lock because it is not
-* necessary. It is not necessary because we're in the struct file
-* cleanup path, and this means that noone is using this file anymore.
-* The only hit might come from ep_free() but by holding the semaphore
-* will correctly serialize the operation. We do need to acquire
-* ep-sem after epmutex because ep_remove() requires it when called
-* from anywhere but ep_free().
-*/
-   mutex_lock(epmutex);
+   /* This is called without locks, so we

Re: 2.6.24-rc6: possible recursive locking detected

2008-01-07 Thread Davide Libenzi
On Sun, 6 Jan 2008, Christian Kujau wrote:

 On Sat, 5 Jan 2008, Davide Libenzi wrote:
  A solution may be to move the call to ep_poll_safewake() (that'd become a
  simple wake_up()) inside a tasklet or whatever is today trendy for delayed
  work. But his kinda scares me to be honest, since epoll has already a
  bunch of places where it could be asynchronously hit (plus performance
  regression will need to be verified).
 
 Although I'm not able to reproduce this one right now, I'm happy to test any
 patches you guys come up with.

There's no need to reproduce it. It's there, it's among us ;)



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Improve scalability of epoll_ctl

2008-01-08 Thread Davide Libenzi
On Tue, 8 Jan 2008, Eric Dumazet wrote:

 Changli Gao a écrit :
  Replace the epitem rbtree with a dynamic array to get the constant 
  insertion/deletion/modification time of the file descriptors. Reuse the 
  size argument of epoll_create, however the size must be smaller than the 
  max file descriptor number: ether the resource limitation or the compiling 
  time limitation.
 

 Hum, you should read this :
 
 http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc3/2.6.7-rc3-mm2/broken-out/epoll-uses-rbtrees.patch
 
 
 Your patch is a revert of this change, it probably wont be accepted.
 
 epoll_ctl() is scalable, since it's O(log2(N)) : With 1 millions files
 descriptors, that means less than 20 nodes to lookup in the tree.

Indeed, and we aren't going back.


 In what situation do you think epoll_ctl() performance is bad ?

I'm sure you can cook up a microbench that only does epoll_ctl(), and with 
a good hash implementation you get better numbers. This is what I did 
before I changed to rbtree, and the numbers did not justify the code (the 
size in epoll_create() is just an hint, and that means you need to 
handle resizing - plus other things like locking up memory when the number 
of fds shrinks, and other stuff I don't even remember).



- Davide



Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Davide Libenzi
On Thu, 10 Jan 2008, Rusty Russell wrote:

 On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
   The latter.  A ring is optimal for processing a huge number of requests,
   but if you're really going to be firing off syslet threads all over the
   place you're not going to be optimal anyway.  And being able to point the
   return value to the stack or into some datastructure is way nicer to code
   (zero setup == easy to understand and easy to convert).
 
  One of Linus' rhetorical requirements for the syslets work is that we be
  able to wait for the result without spending overhead building up state
  in some completion context.  The userland rings in the current syslet
  patches achieve that and don't seem outrageously complicated.
 
 I'd have to read his original statement, but eventfd doesn't build up state, 
 so I think it qualifies.

I think you and Zach are talking about different issues ;)
Eventfd born for two different reasons. First, to be able to have 
userspace to signal to a poll/select/epoll based listener an event. This 
can elso be done with pipes, but eventfd has a few advantages over pipes 
(3-4 times faster and *a lot* less memory footprint). Second, as a generic 
way for kernel subsystems to signal completions to a poll/select/epoll 
userspace listener. And this is what is used in the new KAIO eventfd 
feature (patch was like 5 lines IIRC).
This allow for KAIO events to be signaled to poll/select/epoll in a pretty 
easy way, using a simple extension of the AIO API.
What we talked originally with Ingo, when the first syslet code came up, 
was the ability to do the reverse thing. That is, host an epoll_wait() 
inside a syslet, and gather the completion using whatever the syslet code 
was/is going to use for it.
Given that 1) the eventfd completion patch was trivial and immediately 
available 2) the future of the whole syslet concept was/is still unclear, 
I believe it makes/made sense. If the syslets will become mainline, it'll 
mean that userspace will then be able to select the event-completion 
hosting method that better suits their needs. That are, either AIO 
completion hosted inside an epoll_wait() via an eventfd, or an 
epoll_wait() hosted inside a syslet.




- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: A revised timerfd API

2007-09-18 Thread Davide Libenzi
On Tue, 18 Sep 2007, Michael Kerrisk wrote:

 The four designs are:
 
 a) A multiplexing timerfd() system call.
 b) Creating three syscalls analogous to the POSIX timers API (i.e.,
timerfd_create/timerfd_settime/timerfd_gettime).
 c) Creating a simplified timerfd() system call that is integrated
with the POSIX timers API.
 d) Extending the POSIX timers API to support the timerfd concept.

If you really want to shoot yourself in your foot, I'd pick bullet B.
Bullet A makes me sea-sick, and bullets C and D, well, let's leave POSIX 
APIs being *POSIX* APIs.
Once you remove all the ifs and elses that resulted from your previous 
bullet A multiplexing implementation, timerfd_gettime and timerfd_settime 
should result in being pretty slick.
I still think we could have survived w/out all this done inside the 
kernel though.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi
On Wed, 19 Sep 2007, Nagendra Tomar wrote:

[ ... ]

Seems like epoll ate your message too :)
Can you repost your bug report and the patch?


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi
On Wed, 19 Sep 2007, David Miller wrote:

 From: Nagendra Tomar [EMAIL PROTECTED]
 Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
 
  With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
  not return, even when the incoming acks free the buffers.
  Note that this patch assumes that the SOCK_NOSPACE check in
  tcp_check_space is a trivial optimization which can be safely removed.
 
 I already replied to your patch posting explaining that whatever is
 not setting SOCK_NOSPACE should be fixed instead.
 
 Please address that, thanks.

You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
completely device-unaware interface like epoll, I hope?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi
On Wed, 19 Sep 2007, Nagendra Tomar wrote:

 Definitely not ! 
 
 The point is that the tcp write space available 
 wakeup does not get called if SOCK_NOSPACE bit is not set. This was
 fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
 indicated that someone really cared abt the wakeup). Now after the
 introduction of callback'ed wakeups, we might have some work to
 do inside the callback even if there is nobody interested in the wakeup
 at that point of time. 
 
 In this particular case the ep_poll_callback is not getting called and
 hence the socket fd is not getting added to the ready list.

I know, I saw the patch. I was just commenting the point where DaveM was 
heading to ;)
This things needs to be looked at a little bit more deeply.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi
On Wed, 19 Sep 2007, Nagendra Tomar wrote:

 The tcp_check_space() function calls tcp_new_space() only if the
 SOCK_NOSPACE bit is set in the socket flags. This is causing Edge Triggered
 EPOLLOUT events to be missed for TCP sockets, as the ep_poll_callback() 
 is not called from the wakeup routine.
 
 The SOCK_NOSPACE bit indicates the user's intent to perform writes
 on that socket (set in tcp_sendmsg and tcp_poll). I believe the idea 
 behind the SOCK_NOSPACE check is to optimize away the tcp_new_space call
 in cases when user is not interested in writing to the socket. These two
 take care of all possible scenarios in which a user can convey his intent
 to write on that socket.
 
 Case 1: tcp_sendmsg detects lack of sndbuf space
 Case 2: tcp_poll returns not writable
 
 This is fine if we do not deal with epoll's Edge Triggered events (EPOLLET).
 With ET events we can have a scenario where the SOCK_NOSPACE bit is not set,
 as the user has neither done a sendmsg nor a poll/epoll call that returned
 with the POLLOUT condition not set. 

Looking back at it, I think the current TCP code is right, once you look 
at the event to be a output buffer full-with_space transition.
If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
(free space on the output buffer), if you do not consume it (say a 
tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
anymore since they happen on the full-with_space transition.
Yes, I know, the read size (EPOLLIN) works differently and you get an 
event for every packet you receive. And yes, I do not like asymmetric 
things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi
On Thu, 20 Sep 2007, Eric Dumazet wrote:

 Does it means that with your patch each ACK on a ET managed socket will
 trigger an epoll event   ?
 
 Maybe your very sensitive high throuput appication needs to set a flag or
 something at socket level to ask for such a behavior.
 
 The default should stay as is. That is an event should be sent only if someone
 cared about the wakeup.

Unfortunately f_op-poll() does not let the caller to specify the events 
it's interested in, that would allow to split send/recevie wait queues and 
better detect read/write cases.
The detection of a waitqueue_active(-sk_wr_sleep) would work fine in 
detecting is someone is actually waiting for a write, w/out the false 
positives triggered by the read-waiters.
That would be a very sane thing to do, but would require a bigdumb change 
to all the -poll around (that could be automated by a script - devices 
not caring about the events hint can just continue to use the single queue 
like they currently do), and a more critical and gradual change of all the 
devices that wants to take advantage of it.
That way, no more magic bits are needed, and a simple waitqueue_active() 
would tell you if someone is waiting for write-space events.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] signalfd simplification

2007-09-20 Thread Davide Libenzi
The following patch simplifies signalfd code, by avoiding it to remain 
attached to the sighand during its lifetime.
In this way, the signalfd remain attached to the sighand only during 
poll(2) (and select and epoll) and read(2). This also allows to remove all 
the custom tsk == current checks in kernel/signal.c, since 
dequeue_signal() will only be called by current.
I think this is also what Ben was suggesting time ago.
The external effect of this, is that a thread can extract only its own 
private signals and the group ones. I think this is an acceptable 
behaviour, in that those are the signals the thread would be able to fetch 
w/out signalfd.
Oleg, as far as the smart wakeup thing. Can't be done w/out huge 
revolutions to the kernel code.
It can easily be done for the -read(), but not for the -poll(). You need 
to be the provider of the poll_table in order to install custom wakeups, 
and inside -poll(), the providers are either fs/select.c or 
fs/eventpoll.c (or fs/aio.c). We could add a new poll_wait_func() to pass 
the function to be called, and that would work for fs/select.c (in there 
you can filter based on the signal and eventually issue the real wakeup). 
But it won't work for epoll, since it already expect to install its own 
callback, and ATM callbacks can't be chained.
Since the thundering herd effect on signalfds should be pretty limited, I 
think it's not worth the change.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide



---
 fs/exec.c |3 
 fs/signalfd.c |  190 +++---
 include/linux/init_task.h |2 
 include/linux/sched.h |2 
 include/linux/signalfd.h  |   40 -
 kernel/exit.c |9 --
 kernel/fork.c |2 
 kernel/signal.c   |8 -
 8 files changed, 39 insertions(+), 217 deletions(-)

Index: linux-2.6.mod/fs/signalfd.c
===
--- linux-2.6.mod.orig/fs/signalfd.c2007-09-19 15:54:21.0 -0700
+++ linux-2.6.mod/fs/signalfd.c 2007-09-19 15:54:42.0 -0700
@@ -11,8 +11,10 @@
  *  Now using anonymous inode source.
  *  Thanks to Oleg Nesterov for useful code review and suggestions.
  *  More comments and suggestions from Arnd Bergmann.
- * Sat May 19, 2007: Davi E. M. Arnaut [EMAIL PROTECTED]
+ *  Sat May 19, 2007: Davi E. M. Arnaut [EMAIL PROTECTED]
  *  Retrieve multiple signals with one read() call
+ *  Sun Jul 15, 2007: Davide Libenzi [EMAIL PROTECTED]
+ *  Attach to the sighand only during read() and poll().
  */
 
 #include linux/file.h
@@ -27,102 +29,12 @@
 #include linux/signalfd.h
 
 struct signalfd_ctx {
-   struct list_head lnk;
-   wait_queue_head_t wqh;
sigset_t sigmask;
-   struct task_struct *tsk;
 };
 
-struct signalfd_lockctx {
-   struct task_struct *tsk;
-   unsigned long flags;
-};
-
-/*
- * Tries to acquire the sighand lock. We do not increment the sighand
- * use count, and we do not even pin the task struct, so we need to
- * do it inside an RCU read lock, and we must be prepared for the
- * ctx-tsk going to NULL (in signalfd_deliver()), and for the sighand
- * being detached. We return 0 if the sighand has been detached, or
- * 1 if we were able to pin the sighand lock.
- */
-static int signalfd_lock(struct signalfd_ctx *ctx, struct signalfd_lockctx *lk)
-{
-   struct sighand_struct *sighand = NULL;
-
-   rcu_read_lock();
-   lk-tsk = rcu_dereference(ctx-tsk);
-   if (likely(lk-tsk != NULL))
-   sighand = lock_task_sighand(lk-tsk, lk-flags);
-   rcu_read_unlock();
-
-   if (!sighand)
-   return 0;
-
-   if (!ctx-tsk) {
-   unlock_task_sighand(lk-tsk, lk-flags);
-   return 0;
-   }
-
-   if (lk-tsk-tgid == current-tgid)
-   lk-tsk = current;
-
-   return 1;
-}
-
-static void signalfd_unlock(struct signalfd_lockctx *lk)
-{
-   unlock_task_sighand(lk-tsk, lk-flags);
-}
-
-/*
- * This must be called with the sighand lock held.
- */
-void signalfd_deliver(struct task_struct *tsk, int sig)
-{
-   struct sighand_struct *sighand = tsk-sighand;
-   struct signalfd_ctx *ctx, *tmp;
-
-   BUG_ON(!sig);
-   list_for_each_entry_safe(ctx, tmp, sighand-signalfd_list, lnk) {
-   /*
-* We use a negative signal value as a way to broadcast that the
-* sighand has been orphaned, so that we can notify all the
-* listeners about this. Remember the ctx-sigmask is inverted,
-* so if the user is interested in a signal, that corresponding
-* bit will be zero.
-*/
-   if (sig  0) {
-   if (ctx-tsk == tsk) {
-   ctx-tsk = NULL;
-   list_del_init(ctx-lnk);
-   wake_up(ctx-wqh

Re: Tesing of / bugs in new timerfd API

2007-12-17 Thread Davide Libenzi
On Mon, 17 Dec 2007, Michael Kerrisk wrote:

  Can you try the two patches below? I tried them on my 32 bit box (one of
  the rare beasts still lingering around here) and it seems to be working
  fine (those go on top of the previous ones).
 
 Against 2.6.24-rc5, I applied first your earlier patches (v3) and
 then the newest patch.  My tests confirm that:
 
  This fixed the 32 bit tick-count truncation, and makes the time returned
  to be the remaining time till the next expiration.
 
 Are you going to resubmit a new patch set that includes these latest changes?

Yes, I'll send them out to Andrew today.


- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/2] timerfd - make hrtimer_forward() to return a u64

2007-12-17 Thread Davide Libenzi
This patch makes hrtimer_forward() to return a u64 instead of unsigned long.
Since timerfd returns the number of timer ticks in a u64 variable, and
hrtimer_forward() is used to calculate the timer ticks, the patch allow full
64 bit usage even on 32 bit platforms. The core of the hrtimer_forward() ticks
calculation, ktime_divns(), was already having the result in u64 and it was
chopping it to unsigned long.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer-base-get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG  64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG  64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc = sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG = 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer-expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr-it.real.interval.tv64 == 0)
return;
 
-   timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(),
-   timr-it.real.interval);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer-base-get_time(),
+   timr-it.real.interval);
 
timr-it_overrun_last = timr-it_overrun;
timr-it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr-it_overrun +=
+   timr-it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr-it.real.interval);
ret = HRTIMER_RESTART;
@@ -662,7 +663,7 @@
 */
if (iv.tv64  (timr-it_requeue_pending  REQUEUE_PENDING ||
(timr-it_sigev_notify  ~SIGEV_THREAD_ID) == SIGEV_NONE))
-   timr-it_overrun += hrtimer_forward(timer, now, iv);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer, now, 
iv);
 
remaining = ktime_sub(timer-expires, now);
/* Return 0 only, when the timer is expired and not pending */
Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-13 16:49:46.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:04:36.0

[patch 2/2] timerfd - make the returned time to be the remaining time till the next expiration

2007-12-17 Thread Davide Libenzi
Make the returned time to be the remaining time till the next expiration.
If the timer is already expired, and there's no next expiration, zero will
be returned.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/timerfd.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-14 16:04:36.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:05:32.0 -0800
@@ -49,6 +49,15 @@
return HRTIMER_NORESTART;
 }
 
+static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) {
+   ktime_t now, remaining;
+
+   now = ctx-tmr.base-get_time();
+   remaining = ktime_sub(ctx-tmr.expires, now);
+
+   return remaining.tv64  0 ? ktime_set(0, 0): remaining;
+}
+
 static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
@@ -240,7 +249,7 @@
if (ctx-expired  ctx-tintv.tv64)
hrtimer_forward_now(ctx-tmr, ctx-tintv);
 
-   kotmr.it_value = ktime_to_timespec(ctx-tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx-tintv);
 
/*
@@ -274,7 +283,7 @@
hrtimer_forward_now(ctx-tmr, ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
}
-   kotmr.it_value = ktime_to_timespec(ctx-tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx-tintv);
spin_unlock_irq(ctx-wqh.lock);
fput(file);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fix group stop with exit race

2007-12-05 Thread Davide Libenzi
On Wed, 5 Dec 2007, Oleg Nesterov wrote:

 do_signal_stop() counts all sub-thread and sets -group_stop_count 
 accordingly.
 Every thread should decrement -group_stop_count and stop, the last one should
 notify the parent.
 
 However a sub-thread can exit before it notices the signal_pending(), or it 
 may
 be somewhere in do_exit() already. In that case the group stop never finishes
 properly.
 
 Note: this is a minimal fix, we can add some optimizations later. Say we can
 return quickly if thread_group_empty(). Also, we can move some signal related
 code from exit_notify() to exit_signals().
 
 Signed-off-by: Oleg Nesterov [EMAIL PROTECTED]

Looks OK for me, even though we're doing more work on the exit path. OTOH 
I don't see a non-racy way of doing it w/out grabbing the lock. Did you 
try to bench how much this change costs?
Anyway, looks sane to me...

Acked-by: Davide Libenzi [EMAIL PROTECTED]



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] move the related code from exit_notify() to exit_signals()

2007-12-06 Thread Davide Libenzi
On Thu, 6 Dec 2007, Oleg Nesterov wrote:

 The previous bugfix was not optimal, we shouldn't care about group stop when
 we are the only thread or the group stop is in progress. In that case nothing
 special is needed, just set PF_EXITING and return.
 
 Also, take the related TIF_SIGPENDING re-targeting code from exit_notify().
 
 So, from the performance POV the only difference is that we don't trust
 !signal_pending() until we take -siglock. But this in fact fixes another
 ___pure___ theoretical minor race. __group_complete_signal() finds the task
 without PF_EXITING and chooses it as the target for signal_wake_up(). But
 nothing prevents this task from exiting in between without noticing the
 pending signal and thus unpredictably delaying the actual delivery.

For a second there, you got me confused, since it wasn't clear to me this 
patch was going over the previous one. When posting patches that are not 
in any tree, it may be wise to include the set they nest onto ;)


 + if (!signal_pending(tsk))
 + goto out;
 +
 + /* It could be that __group_complete_signal() choose us to
 +  * notify about group-wide signal. Another thread should be
 +  * woken now to take the signal since we will not.
 +  */
 + for (t = tsk; (t = next_thread(t)) != tsk; )
 + if (!signal_pending(t)  !(t-flags  PF_EXITING))
 + recalc_sigpending_and_wake(t);
 +
 + if (unlikely(tsk-signal-group_stop_count) 
 + !--tsk-signal-group_stop_count) {
 + tsk-signal-flags = SIGNAL_STOP_STOPPED;
 + group_stop = 1;
 + }
 +out:

Looks fine to me, even though the one above could simply be an:

if (signal_pending(tsk)) {
...
}

(since the out label is used by that if only).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Davide Libenzi
On Tue, 20 Nov 2007, Ingo Molnar wrote:

 * H. Peter Anvin [EMAIL PROTECTED] wrote:
 
  It seems that you're doing the same thing in both cases, except you're 
  now extending it to include other random functionality, which means 
  other things than syslets are suddenly affected.
 
  syslets are arguably a little bit different, since what you're 
  effectively doing there is running a miniature interpreted language in 
  kernel space.  A higher startup overhead should be acceptable, since 
  you're amortizing it over a larger number of calls.  Extending that 
  mechanism suddenly means you HAVE to use that interpreted language 
  message mechanism to access certain system calls, which really does 
  not seem like a good thing neither for performance nor for encouraging 
  sane design of interfaces.
 
 whether that interpreted syslet language survives is still an open 
 question - it was extremely ugly when i wrote the first version of it 
 and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

 Hey Davide,
 
 Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...

Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was 
merged. Some screw up reverted it?


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

 On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote:
  On Thu, 22 Nov 2007, Michael Kerrisk wrote:
 
   Hey Davide,
  
   Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
 
  Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was
  merged. Some screw up reverted it?
 
 t looks that way.

I'm looking at the log now. It never went in actually. Andrew-san, what 
happened?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Andrew Morton wrote:

 On Thu, 22 Nov 2007 11:46:13 -0800 (PST) Davide Libenzi [EMAIL PROTECTED] 
 wrote:
 
  On Thu, 22 Nov 2007, Michael Kerrisk wrote:
  
   On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote:
On Thu, 22 Nov 2007, Michael Kerrisk wrote:
   
 Hey Davide,

 Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
   
Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it 
was
merged. Some screw up reverted it?
   
   t looks that way.
  
  I'm looking at the log now. It never went in actually. Andrew-san, what 
  happened?
  
 
 Last I recall, we removed the API for 2.6.23 because we intended to do a
 different interface for 2.6.24.
 
 But I don't recall seeing any timerfd patches in maybe a month.

Was sent on Sep 23, Subject: new timerfd API
Do you want me to repost?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v2 - un-break CONFIG_TIMERFD

2007-11-23 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:13:16.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-23 13:36:42.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] Timerfd v2 - introduce a new hrtimer_forward_now() function

2007-11-23 Thread Davide Libenzi
I think that advancing the timer against the timer's current now can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:13:21.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-23 13:36:36.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer-base-get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Andrew Morton wrote:

  I suppose this means that timerfd will only go in for 2.6.25.  I don't
  have a problem with that, but we better make sure that the existing
  timerfd in 2.6.24 is still disabled.  (Andrew had a one liner for
  that, but I haven't checked if it's in place.)
  
 
 I have no timerfd patches here.

Yes, it's disabled, and yes, I'll repost today ...


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Ulrich Drepper wrote:

 On Nov 23, 2007 9:29 AM, Davide Libenzi [EMAIL PROTECTED] wrote:
  Yes, it's disabled, and yes, I'll repost today ...
 
 I haven't seen the patch and don't feel like searching.  So I say it
 here: please mak sure you add a flags parameter to the system call
 itself (instead of adding it on as for eventfd and signalfd).  We need
 to be able to use O_CLOEXEC some way or another.

I'm more then OK about adding a flags parameter. If it was for me, I'd add 
it even to eventfd and signalfd. I asked Linus if he was OK about adding 
the flags parameter to all. He didn't reply, and I read that as no.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v2 - new timerfd API

2007-11-23 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:13:19.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-23 13:36:39.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(ctx-tmr,
-   hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ticks += (u64) hrtimer_forward_now(ctx-tmr,
+  ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired

[patch 3/4] Timerfd v2 - wire the new timerfd API to the x86 family

2007-11-23 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-23 13:36:40.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-23 13:36:40.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-23 
13:36:40.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-23 13:36:40.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v2 - new timerfd API

2007-11-24 Thread Davide Libenzi
On Sat, 24 Nov 2007, Michael Kerrisk wrote:

  +asmlinkage long sys_timerfd_create(int clockid, int flags)
   {
  -   int error;
  +   int error, ufd;
  struct timerfd_ctx *ctx;
  struct file *file;
  struct inode *inode;
  -   struct itimerspec ktmr;
  -
  -   if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
  -   return -EFAULT;
   
  if (clockid != CLOCK_MONOTONIC 
  clockid != CLOCK_REALTIME)
  return -EINVAL;
 
 Could I suggest here, the following placeholder addition:
 
   if (flags != 0)
   return -EINVAL;

Make sense, will repost.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] Timerfd v3 - introduce a new hrtimer_forward_now() function

2007-11-25 Thread Davide Libenzi
I think that advancing the timer against the timer's current now can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:55:16.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-24 12:48:05.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer-base-get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v3 - new timerfd API

2007-11-25 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 166 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:55:16.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-24 12:49:21.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(ctx-tmr,
-   hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ticks += (u64) hrtimer_forward_now(ctx-tmr,
+  ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired

[patch 3/4] Timerfd v3 - wire the new timerfd API to the x86 family

2007-11-25 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-24 12:49:28.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-24 12:49:28.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-24 
12:49:28.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-24 12:49:28.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v3 - un-break CONFIG_TIMERFD

2007-11-25 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:55:15.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-24 12:49:30.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread Davide Libenzi
On Mon, 26 Nov 2007, H. Peter Anvin wrote:

 Ingo Molnar wrote:
  
  So it's not like sys_indirect() would break some magic pristine state of a
  flat parameter space - on the contrary, most of the nontrivial syscalls take
  pointers to structures or pointers to streams of information. The parameter
  count histogram i believe further underlines this point:
  
#args   #syscalls
-
0   22
1   51
2   83
3   85
4   40
5   23
68
  
  the natural 'center' of function call parameter counts is around 1-4
  parameters, and that is natural. (most operators that the human brain
  prefers to operate with are like that - having higher complexity than that
  often defeats the purpose of getting an API used by ... humans.)
  
 
 I was preparing a response to Linus' email, but I really feel this needs to be
 addressed specifically.
 
 When it comes to dealing with the operator-visible state, what matters is what
 happens on the API level, NOT on the system call level. Furthermore, the
 proposed sys_indirect interface just means that there are parameters hidden
 from immediately view, even though they fundamentally change the operation
 performed, and that it is much harder to correlate, say, the output of
 strace(1) with what actually happened in the program.  So from a
 *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system 
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong 
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the 
solution used to address the first, internal kernel APIs should be 
designed so that parameters flow down from the system call to the 
parameter's user code. IMO, besides very few cases where it could make 
some sense [*], setting some thread-global bits in the upper layer, to be 
magically picked up by code in the lower layers, does not lead to readable 
interfaces.



[*] Things that already read from a shared context, that is already 
exposed to the user through some sort of set/get APIs.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

 On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  +static struct file *timerfd_fget(int fd)
  +{
  +   struct file *file;
  +
  +   file = fget(fd);
  +   if (!file)
  +   return ERR_PTR(-EBADF);
  +   if (file-f_op != timerfd_fops) {
  +   fput(file);
  +   return ERR_PTR(-EINVAL);
  +   }
  +
  +   return file;
  +}
 
 I suppose we could use fget_light() in here sometime.  It is significantly
 quicker in microbenchmarks.  Or it was - nobody has checked that in a few
 years afaik.

Should I now? Half of the internet traffic of the last three month was 
generated by me posting those timerfd patches :)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

 On Tue, 27 Nov 2007 12:47:46 -0800 (PST)
 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  On Tue, 27 Nov 2007, Andrew Morton wrote:
  
   On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] 
   wrote:
   
+static struct file *timerfd_fget(int fd)
+{
+   struct file *file;
+
+   file = fget(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);
+   if (file-f_op != timerfd_fops) {
+   fput(file);
+   return ERR_PTR(-EINVAL);
+   }
+
+   return file;
+}
   
   I suppose we could use fget_light() in here sometime.  It is significantly
   quicker in microbenchmarks.  Or it was - nobody has checked that in a few
   years afaik.
  
  Should I now?
 
 No great rush.  It'd be fun to see if it actually makes any measurable
 difference on modern hardware (hint ;)).

I was going to say BS, given that at the time of the tests, the files 
struct was protected by an rwlock whereas now there's a rcu one.
But that seems not the case:

http://www.xmailserver.org/fget-light-test.c


$ fget-light-test -r 9   
warming up ...
testing non-shared ...
time = 314.354000 ms
testing shared ...
time = 390.781000 ms



And here is the oprofile output:

[SHARED CASE]
samples  %app name symbol name
7436 28.9339  vmlinux  __clear_user
2369  9.2179  vmlinux  system_call
1710  6.6537  vmlinux  fget_light
1244  4.8405  vmlinux  inotify_dentry_parent_queue_event
1128  4.3891  vmlinux  sys_read
1041  4.0506  libpthread-2.6.1.so  __read_nocancel
1027  3.9961  Xorg (no symbols)
978   3.8054  vmlinux  read_zero
755   2.9377  vmlinux  vfs_read
545   2.1206  vmlinux  inotify_inode_queue_event
507   1.9728  vmlinux  sysret_check
414   1.6109  vmlinux  rw_verify_area
405   1.5759  vmlinux  unix_poll
371   1.4436  nvidia   (no symbols)
333   1.2957  vmlinux  acpi_pm_read
311   1.2101  nvidia_drv.so(no symbols)
290   1.1284  vmlinux  do_select
288   1.1206  vmlinux  dnotify_parent
253   0.9844  libc-2.6.1.so(no symbols)
232   0.9027  bash (no symbols)
216   0.8405  libpthread-2.6.1.so  __pthread_enable_asynccancel


[UNSHARED CASE]
samples  %app name symbol name
6542 27.6922  vmlinux  __clear_user
4074 17.2452  vmlinux  vfs_read
1266  5.3590  vmlinux  inotify_inode_queue_event
1091  4.6182  Xorg (no symbols)
1059  4.4827  vmlinux  system_call
937   3.9663  libpthread-2.6.1.so  __read_nocancel
544   2.3027  vmlinux  clear_user
484   2.0488  vmlinux  dnotify_parent
445   1.8837  vmlinux  read_zero
438   1.8540  vmlinux  sysret_check
414   1.7525  vmlinux  unix_poll
407   1.7228  nvidia   (no symbols)
389   1.6466  vmlinux  acpi_pm_read
315   1.3334  nvidia_drv.so(no symbols)
312   1.3207  vmlinux  inotify_dentry_parent_queue_event
312   1.3207  vmlinux  sys_read
305   1.2911  vmlinux  rw_verify_area
304   1.2868  vmlinux  do_select
222   0.9397  libc-2.6.1.so(no symbols)
214   0.9059  bash (no symbols)
196   0.8297  fget-light-test  read_test
185   0.7831  vmlinux  fget_light



You can clearly notice the fget_light() drop out of the relevance window.
BTW, are all those notify noises supposed to be there?




- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm2 - Build Failure on powerpc timerfd() undeclared

2007-11-28 Thread Davide Libenzi
On Wed, 28 Nov 2007, Andrew Morton wrote:

 On Wed, 28 Nov 2007 14:32:07 +0100 Arnd Bergmann [EMAIL PROTECTED] wrote:
 
  On Wednesday 28 November 2007, Kamalesh Babulal wrote:
   Kernel build fails, with build error
   
     CC      arch/powerpc/platforms/cell/spu_callbacks.o
   In file included from arch/powerpc/platforms/cell/spu_callbacks.c:49:
   include/asm/systbl.h:312: error: ‘sys_timerfd’ undeclared here (not in a 
   function)
   make[2]: *** [arch/powerpc/platforms/cell/spu_callbacks.o] Error 1
   make[1]: *** [arch/powerpc/platforms/cell] Error 2
   make: *** [arch/powerpc/platforms] Error 2
   
  
  I guess all architectures except x86 are currently broken because they
  reference the old sys_timerfd function.
 
 None of them were broken in my testing and I'm unsure why powerpc broke
 here.
 
  This patch should add the missing
  bits to powerpc.
  
 
 Because the patches in -mm left the stubs in place in sys_ni.c and powerpc
 _should_ have (incorrectly) picked those up.

My fault. I forgot to update sys_ni.c with the new functions (and with the
sys_timerfd-sys_timerfd_create name change).
Do you want a patch Andrew?



- Davide


[patch] update sys_ni.c with the new timerfd syscalls

2007-11-28 Thread Davide Libenzi
Update sys_ni.c with the new timerfd syscalls.


Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 kernel/sys_ni.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/kernel/sys_ni.c
===
--- linux-2.6.mod.orig/kernel/sys_ni.c  2007-11-28 11:44:48.0 -0800
+++ linux-2.6.mod/kernel/sys_ni.c   2007-11-28 11:46:01.0 -0800
@@ -153,7 +153,9 @@
 
 /* New file descriptors */
 cond_syscall(sys_signalfd);
-cond_syscall(sys_timerfd);
+cond_syscall(sys_timerfd_create);
+cond_syscall(sys_timerfd_settime);
+cond_syscall(sys_timerfd_gettime);
 cond_syscall(compat_sys_signalfd);
 cond_syscall(compat_sys_timerfd);
 cond_syscall(sys_eventfd);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

 Davide, Andrew,
 
 I applied Davide's v3 patchset (sent into LKML on 25 Nov) against
 2.4.24-rc3, and did various tests (all on x86).  Several tests
 were done using the program at the foot of this mail.  Various others
 were done by cobbling together bits of code that I haven't included
 here.

Thanks for such a thorough test Michael.



 Tested: after setting a CLOCK_REALTIME timer with the
 TFD_TIMER_ABSTIME flag to expire at some time in the past with
 a non-zero interval (e.g., setting 100 seconds in the past, with
 a 5 second interval), read() from the file descriptor returns
 the correct number of expirations (e.g., 20).
 
 This seems a reasonable thing to do, I suppose.  However, while
 playing around to test this, I found what looks like a bug (see
 below).
 
 BUG 1:
 However, this test exposed what looks like a bug: if I set a
 CLOCK_REALTIME clock to expire in the past, with a very small
 interval, then the maximum number of expirations that can be
 returned via read seems to be limited to 32 bits, even though
 we have a 64-bit value for returning this information.
 I haven't checked the kernel source to determine where this
 bug is.

Yes, true. Right now hrtimer_forward() (that we use to get the missed 
ticks) returns an unsigned long, that on 32 bit is (duh!) 32 bit :)
But hrtimer_forward() has all in place to just return an u64. So, Thomas, 
would it be OK to have hrtimer_forward() (and the new hrtimer_forward_now())
to return a u64?




 BUG 2:
 The last sentence does not match the implementation.
 (Nor is it consistent with the behavior of POSIX timers.
 And I *think* things did work correctly in the original
 timerfd() implementation, but I have not gone back to check.)
 
 Suppose that we set an absolute timer to expire 100 seconds
 in the future.  Then according to this sentence of the man
 page then each subsequent call to timerfd_gettime() should
 retrun an itimerspec structure whose it_value steadily
 decreases from 100 to 0 (when the timer expires).  (This
 is the behavior in the analogous situation with POSIX timers
 and with setitimer()/getitimer().)
 
 However, the implementation of timerfd_gettime() always
 returns the time when the timer would next expire, and
 this value depends on whether TFD_TIMER_ABSTIME was specified
 when setting the timer.

This is been like that from the beginning of the new API. So no, the 
previous was behaving exactly the same WRT this feature.
Is this something really needed?



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

   BUG 2:
   The last sentence does not match the implementation.
   (Nor is it consistent with the behavior of POSIX timers.
   And I *think* things did work correctly in the original
   timerfd() implementation, but I have not gone back to check.)
  
   Suppose that we set an absolute timer to expire 100 seconds
   in the future.  Then according to this sentence of the man
   page then each subsequent call to timerfd_gettime() should
   retrun an itimerspec structure whose it_value steadily
   decreases from 100 to 0 (when the timer expires).  (This
   is the behavior in the analogous situation with POSIX timers
   and with setitimer()/getitimer().)
  
   However, the implementation of timerfd_gettime() always
   returns the time when the timer would next expire, and
   this value depends on whether TFD_TIMER_ABSTIME was specified
   when setting the timer.
 
  This is been like that from the beginning of the new API. So no, the
  previous was behaving exactly the same WRT this feature.
  Is this something really needed?
 
 Three reasons that I think of off the top of my head (and there might
 well be more reasosn) why this should change:
 
 a) consistency with the other two timer APIs (POSIX timers
 (timer_create(), etc.), and setitimer()/getitimer()).
 
 b) Returning the amount of time until the next expiration  is more
 useful to userland: I'd say the most common case is for userland to
 want to know how long until the next expiration occurs, or to adjust
 that time by adding/subtracting some value to the existing setting.
 That is difficult to with the current implementation: the userland app
 must use timer_gettime(), call clock_gettime(), and calculate the
 difference between the two, in order to know how much time remains
 until the next timer expiration.

Bah, I don't want to argue with that because otherwise it starts going 
towards the typical use cases listing, that can be found the same exact 
reasons to have one or the other way. And we end up wasting lots of time.
We'd just have to move another thing that userspace could do, inside the 
kernel (subtract the current value returned by hrtimer_forward() in 
-expires with now).




 c) Currently, the information returned differs depending on whether
 TFD_TIMER_ABSTIME is specified -- this is not the case for the
 analogous situation for POSIX timers.  For POSIX timers, the returned
 setting is always the amount of time until the timer next expires.
 This inconsistency is messy for applications -- the application may
 not (be able to) know whether or not the timer it is examining was set
 using TFD_TIMER_ABSTIME (the timerfd might have been created by a
 library, for example).

Hmm... the time returned is always the next absolute time when the next 
timer tick will go off. It does not depend on TFD_TIMER_ABSTIME. I return 
the -expires field of the timer, and hrtimer_forward() sets it to the 
next absolute time.




- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-16 Thread Davide Libenzi
On Fri, 14 Dec 2007, Michael Kerrisk wrote:

 You snipped my example that demonstrated the problem.  Both of the
 following runs create a timer that expires 10 seconds from now, but
 observe the difference in the value returned by timerfd_gettime():
 
 $ ./timerfd_test 10 # does not use TFD_TIMER_ABSTIME
 Initial setting for settime:   value=10.000, interval=0.000
 ./timerfd_test g
 (elapsed time=  1)
 Current value: value=346.448, interval=0.000
 
 $ ./timerfd_test -a 10  # uses TFD_TIMER_ABSTIME
 Initial setting for settime:   value=1197630855.254, interval=0.000
 ./timerfd_test g
 (elapsed time=  1)
 Current value: value=1197630855.254, interval=0.000
 
 Either there's an inconsistency here depending on the use of
 TFD_TIMER_ABSTIME, or there is a bug in my understanding or my test program
 (but so far I haven't spotted that bug ;-).).

Can you try the two patches below? I tried them on my 32 bit box (one of 
the rare beasts still lingering around here) and it seems to be working 
fine (those go on top of the previous ones).
This fixed the 32 bit tick-count truncation, and makes the time returned 
to be the remaining time till the next expiration.




- Davide



---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer-base-get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG  64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG  64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc = sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG = 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer-expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr-it.real.interval.tv64 == 0)
return;
 
-   timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(),
-   timr-it.real.interval);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer-base-get_time(),
+   timr-it.real.interval);
 
timr-it_overrun_last = timr-it_overrun;
timr-it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr-it_overrun +=
+   timr-it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr-it.real.interval);
ret = 

Re: 2.6.24-rc6: possible recursive locking detected

2008-01-05 Thread Davide Libenzi
On Sat, 5 Jan 2008, Peter Zijlstra wrote:

 
 On Sat, 2008-01-05 at 17:53 +0100, Peter Zijlstra wrote:
  On Sat, 2008-01-05 at 18:12 +1100, Herbert Xu wrote:
   On Fri, Jan 04, 2008 at 09:30:49AM +0100, Ingo Molnar wrote:
   
  [ 1310.670986] =
  [ 1310.671690] [ INFO: possible recursive locking detected ]
  [ 1310.672097] 2.6.24-rc6 #1
  [ 1310.672421] -
  [ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock:
  [ 1310.673238]  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
  [ 1310.673869]
  [ 1310.673870] but task is already holding lock:
  [ 1310.674567]  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
  [ 1310.675267]
  [ 1310.675268] other info that might help us debug this:
  [ 1310.675952] 5 locks held by FahCore_a0.exe/3692:
  [ 1310.676334]  #0:  (rcu_read_lock){..--}, at: [c038b620] 
  net_rx_action+0x60/0x1b0
  [ 1310.677251]  #1:  (rcu_read_lock){..--}, at: [c0388d60] 
  netif_receive_skb+0x100/0x470
  [ 1310.677924]  #2:  (rcu_read_lock){..--}, at: [c03a7fb2] 
  ip_local_deliver_finish+0x32/0x210
  [ 1310.678460]  #3:  (clock-AF_INET){-.-?}, at: [c038164e] 
  sock_def_readable+0x1e/0x80
  [ 1310.679250]  #4:  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
   
   The net part might just be a red herring, since the problem is that
   __wake_up is somehow reentering itself.
  
  /*
   * Perform a safe wake up of the poll wait list. The problem is that
   * with the new callback'd wake up system, it is possible that the
   * poll callback is reentered from inside the call to wake_up() done
   * on the poll wait queue head. The rule is that we cannot reenter the
   * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
   * and we cannot reenter the same wait queue head at all. This will
   * enable to have a hierarchy of epoll file descriptor of no more than
   * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
   * because this one gets called by the poll callback, that in turn is called
   * from inside a wake_up(), that might be called from irq context.
   */
  
  Seems to suggest that the epoll code can indeed recurse into wakeup.
  
  Davide, Johannes, any ideas?
 
 Since EP_MAX_POLLWAKE_NESTS  MAX_LOCKDEP_SUBCLASSES we could perhaps do
 something like:
 
   wake_up_nested(..., wake_nests);
 
 although I'm not quite sure that is correct, my understanding of this
 code is still fragile at best.

I remember I talked with Arjan about this time ago. Basically, since 1) 
you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
are used, you can see a wake_up() from inside another wake_up(), but they 
will never refer to the same lock instance.
Think about:

dfd = socket(...);
efd1 = epoll_create();
efd2 = epoll_create();
epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);

When a packet arrives to the device underneath dfd, the net code will 
issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
callback wakeup entry on that queue, and the wake_up() performed by the 
dfd net code will end up in ep_poll_callback(). At this point epoll 
(efd1) notices that it may have some event ready, so it needs to wake up 
the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
that ends up in another wake_up(), after having checked about the 
recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
avoid stack blasting. Never hit the same queue, to avoid loops like:

epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);

The code if (tncur-wq == wq || ... prevents re-entering the same 
queue/lock.
I don't know how the lockdep code works, so I can't say about 
wake_up_nested(). Although I have a feeling is not enough in this case.
A solution may be to move the call to ep_poll_safewake() (that'd become a 
simple wake_up()) inside a tasklet or whatever is today trendy for delayed 
work. But his kinda scares me to be honest, since epoll has already a 
bunch of places where it could be asynchronously hit (plus performance 
regression will need to be verified).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll design problems with common fork/exec patterns

2007-10-27 Thread Davide Libenzi
On Sat, 27 Oct 2007, Marc Lehmann wrote:

  Please provide some code to illustrate one exact problem you have.
 
// assume there is an open epoll set that listens for events on fd 5
if (fork () = 0)
  {
close (5);
// fd 5 is now removed from the epoll set of the parent.
_exit (0);
  }

Hmmm ... what? I assume you know that:

1) A file descriptor is a userspace view/handle of a kernel object

2) The kernel object has a use-count for as many file descriptors that 
   have been handed out to userspace

3) A close() decreases the internal counter by one

4) The kernel object gets effectively closed when the internal counter 
   goes to zero

5) A fork() acts as a dup() on the file descriptors by hence bumping up 
   its internal counter

6) Epoll removes the file from the set, when the *kernel* object gets 
   closed (internal use-count goes to zero)

With that in mind, how can the code snippet above trigger a removal from 
the epoll set?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll design problems with common fork/exec patterns

2007-10-27 Thread Davide Libenzi
On Sat, 27 Oct 2007, Willy Tarreau wrote:

 On Sat, Oct 27, 2007 at 09:59:07AM -0700, Davide Libenzi wrote:
  On Sat, 27 Oct 2007, Marc Lehmann wrote:
  
Please provide some code to illustrate one exact problem you have.
   
  // assume there is an open epoll set that listens for events on fd 5
  if (fork () = 0)
{
  close (5);
  // fd 5 is now removed from the epoll set of the parent.
  _exit (0);
}
  
  Hmmm ... what? I assume you know that:
  
  1) A file descriptor is a userspace view/handle of a kernel object
  
  2) The kernel object has a use-count for as many file descriptors that 
 have been handed out to userspace
  
  3) A close() decreases the internal counter by one
  
  4) The kernel object gets effectively closed when the internal counter 
 goes to zero
  
  5) A fork() acts as a dup() on the file descriptors by hence bumping up 
 its internal counter
  
  6) Epoll removes the file from the set, when the *kernel* object gets 
 closed (internal use-count goes to zero)
  
  With that in mind, how can the code snippet above trigger a removal from 
  the epoll set?
 
 Davide,
 
 from what I understand, Marc is not asking for the code above to remove
 the fd from the epoll set, but he's in fact complaining that he *observed*
 that the fd was removed from the epoll set in the *parent* process when
 the child closes it, which is of course not expected at all. As strange
 as it looks like, this might need investigation. It is possible that there
 is some strange bug somewhere in some kernel versions.

That would be *really* strange, since epoll hooks in __fput() in order to 
perform proper cleanup. This means that, in the case above, the file will 
be really closed in the parent too. That, I think, would trigger way more 
serious problems in userspace.



 Marc, I think that if you indicate the last kernel version on which you
 observed this and provide a very short and easy reproducer, it would
 help everyone investigating this. Basically something which reports OK
 or KO.

Of course. That'd be great.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-28 Thread Davide Libenzi
On Sat, 27 Oct 2007, David Schwartz wrote:

   I don't see how that can be. Suppose I add fd 8 to an epoll set. 
 Suppose fd
 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
 since there no longer is an fd 8? Events on files registered for epoll
 notification are reported by descriptor, so the set membership has to be
 associated (as reflected into userspace) with the descriptor, not the file.

Eric already answered to your question (epoll deals with internal kernel 
objects - aka file*).
I just want to answer this one for another reason. WTF is wrong with all 
of you Cc-list-trimmers?
Could you *please* stop trimming Cc-lists?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-29 Thread Davide Libenzi
On Sun, 28 Oct 2007, David Schwartz wrote:

 
 Eric Dumazet wrote:
 
  Events are not necessarly reported by descriptors. epoll uses an opaque
  field provided by the user.
 
  It's up to the user to properly chose a tag that will makes sense
  if the user
  app is playing dup()/close() games for example.
 
 Great. So the only issue then is that the documentation is confusing. It
 frequently uses the term fd where it means file. For example, it says:
 
   Q1 What  happens  if  you  add  the  same fd to an
 epoll_set
  twice?
 
   A1 You will probably get EEXIST.  However,  it  is
 possible
  that  two  threads  may  add the same fd twice. This is
 a
  harmless condition.
 
 This gives no reason to think there's anything wrong with adding the same
 file twice so long as you do so through different descriptors. (One can
 imagine an application that does this to segregate read and write operations
 to avoid a race where the descriptor is closed from under a writer due to
 handling a fatal read error.) Obviously, that won't work.

I agree, that is confusing. However, you can safely add two different file 
descriptors pointing to the same file*, with different event masks, and 
that will work as expected.




 And this part:
 
   Q6 Will  the  close of an fd cause it to be removed from
 all
  epoll sets automatically?
 
   A6 Yes.
 
 This is incorrect. Closing an fd will not cause it to be removed from all
 epoll sets automatically. Only closing a file will. This is what caused the
 OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

OTOH you cannot list *every* possible scenario in a man page, otherwise 
you end up writing a book instead of a man page. I will try to find some 
time with Michael to refine the man page.



 PS: It is customary to trim individuals off of CC lists when replying to a
 list when the subject matter of the post is squarely inside the subject of
 the list. If the person CC'd was interested in the list's subject, he or she
 would presumably subscribe to the list. Not everyone wants two copies of
 every post. Not everyone wants a personal copy of every sub-thread that
 results from a post they make. In the past few years, I've received
 approximately an equal number of complaints about trimming CC's on posts to
 LKML and not trimming CC's on such posts.

Does anyone that in 2007 still did not manage to find a way to avoid dups 
in hitting his mailbox, deserve any consideration at all?
OTOH many ppl, like myself, uses To and Cc header to direct email to 
proper folders, where they are treated with a different level of 
attention. And your stripp-all-headers mania screws that up badly.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised signalfd man-page

2007-10-17 Thread Davide Libenzi
On Wed, 17 Oct 2007, Michael Kerrisk wrote:

  With the new code Linus already merged, signalfd does not attach to the 
  sighand anymore, so the orphaned sighand behaviour is no more there.
  An exec() will carry the fd over, and you will be able to use the fd in 
  the same way you did before the exec(). If sigpending()/sigwaitinfo() will 
  show signals available, so it should signalfd.
 
 So I wrote:
 
execve(2) semantics
Just  like  any  other  file  descriptor, a signalfd file
descriptor remains open across an  execve(2),  unless  it
has  been  marked  for close-on-exec (see fcntl(2)).  Any
signals  that  were  available  for  reading  before  the
execve(2)  remain  available to the newly loaded program.
(This is analogous to traditional signal semantics, where
a  blocked  signal that is pending remains pending across
an execve(2).)  (This is analogous to traditional  signal
semantics, where a blocked signal that is pending remains
pending across an execve(2).)
 
 Okay?

Ok



  It'll return the signals that would be normally returned to the thread 
  with the standard signal delivery methods. That is, calling thread private 
  signals, and calling thread-group shared signals.
 
 So I wrote:
 
Thread semantics
The semantics of signalfd file descriptors  in  a  multi-
threaded  program  mirror the standard semantics for sig-
nals.  In other words, when a thread reads  from  a  sig-
nalfd  file descriptor, it will read the signals that are
directed to the thread itself and the  signals  that  are
directed  to the process (i.e., the entire thread group).
(A thread will not be  able  to  read  signals  that  are
directed to other threads in the process.)
 
 Okay?

Ok, although this looks more specific:

(A thread will not be able to read other threads private signals)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-10 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by properly 
initializing the data.


Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/eventpoll.c |2 +-
 include/linux/rbtree.h |   12 
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
@@ -260,7 +260,7 @@
 /* Special initialization for the RB tree node to detect linkage */
 static inline void ep_rb_initnode(struct rb_node *n)
 {
-   rb_set_parent(n, n);
+   rb_init_node(n, n);
 }
 
 /* Removes a node from the RB tree and marks it for a fast is-linked check */
Index: linux-2.6.mod/include/linux/rbtree.h
===
--- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 12:36:13.0 
-0800
+++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
-0800
@@ -112,6 +112,18 @@
struct rb_node *rb_node;
 };
 
+/**
+ * rb_init_node - Initializes the node internal data
+ *
+ * @node: Pointer to the RB-Tree node
+ * @parent: Pointer to the parent node, or NULL
+ *
+ */
+static inline void rb_init_node(struct rb_node *node, struct rb_node *parent)
+{
+   node-rb_parent_color = (unsigned long) parent;
+   node-rb_left = node-rb_right = NULL;
+}
 
 #define rb_parent(r)   ((struct rb_node *)((r)-rb_parent_color  ~3))
 #define rb_color(r)   ((r)-rb_parent_color  1)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by simplifying the 
code to get rid of a few lines that became superfluous after the previous 
epoll changes.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-11 15:31:02.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-11 15:32:46.0 -0800
@@ -257,25 +257,6 @@
(p1-file  p2-file ? -1 : p1-fd - p2-fd));
 }
 
-/* Special initialization for the RB tree node to detect linkage */
-static inline void ep_rb_initnode(struct rb_node *n)
-{
-   rb_set_parent(n, n);
-}
-
-/* Removes a node from the RB tree and marks it for a fast is-linked check */
-static inline void ep_rb_erase(struct rb_node *n, struct rb_root *r)
-{
-   rb_erase(n, r);
-   rb_set_parent(n, n);
-}
-
-/* Fast check to verify that the item is linked to the main RB tree */
-static inline int ep_rb_linked(struct rb_node *n)
-{
-   return rb_parent(n) != n;
-}
-
 /* Tells us if the item is currently linked */
 static inline int ep_is_linked(struct list_head *p)
 {
@@ -283,13 +264,13 @@
 }
 
 /* Get the struct epitem from a wait queue pointer */
-static inline struct epitem * ep_item_from_wait(wait_queue_t *p)
+static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
return container_of(p, struct eppoll_entry, wait)-base;
 }
 
 /* Get the struct epitem from an epoll queue wrapper */
-static inline struct epitem * ep_item_from_epqueue(poll_table *p)
+static inline struct epitem *ep_item_from_epqueue(poll_table *p)
 {
return container_of(p, struct ep_pqueue, pt)-epi;
 }
@@ -411,8 +392,7 @@
list_del_init(epi-fllink);
spin_unlock(file-f_ep_lock);
 
-   if (ep_rb_linked(epi-rbn))
-   ep_rb_erase(epi-rbn, ep-rbr);
+   rb_erase(epi-rbn, ep-rbr);
 
spin_lock_irqsave(ep-lock, flags);
if (ep_is_linked(epi-rdllink))
@@ -728,7 +708,6 @@
goto error_return;
 
/* Item initialization follow here ... */
-   ep_rb_initnode(epi-rbn);
INIT_LIST_HEAD(epi-rdllink);
INIT_LIST_HEAD(epi-fllink);
INIT_LIST_HEAD(epi-pwqlist);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
On Mon, 11 Feb 2008, Andrew Morton wrote:

 On Sun, 10 Feb 2008 13:32:01 -0800 (PST)
 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
  rb_set_parent() accesses node's pointer in its code. This creates a 
  warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
  memory access. The warning is harmless since the following rb-tree node 
  insert is going to overwrite the node data. In any case I think it's 
  better to not have that happening at all, and fix it by properly 
  initializing the data.
  
  
  Signed-off-by: Davide Libenzi [EMAIL PROTECTED]
  
  
  - Davide
  
  
  ---
   fs/eventpoll.c |2 +-
   include/linux/rbtree.h |   12 
   2 files changed, 13 insertions(+), 1 deletion(-)
  
  Index: linux-2.6.mod/fs/eventpoll.c
  ===
  --- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 
  -0800
  +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
  @@ -260,7 +260,7 @@
   /* Special initialization for the RB tree node to detect linkage */
   static inline void ep_rb_initnode(struct rb_node *n)
   {
  -   rb_set_parent(n, n);
  +   rb_init_node(n, n);
   }
   
   /* Removes a node from the RB tree and marks it for a fast is-linked check 
  */
  Index: linux-2.6.mod/include/linux/rbtree.h
  ===
  --- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 
  12:36:13.0 -0800
  +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
  -0800
  @@ -112,6 +112,18 @@
  struct rb_node *rb_node;
   };
   
  +/**
  + * rb_init_node - Initializes the node internal data
  + *
  + * @node: Pointer to the RB-Tree node
  + * @parent: Pointer to the parent node, or NULL
  + *
  + */
  +static inline void rb_init_node(struct rb_node *node, struct rb_node 
  *parent)
  +{
  +   node-rb_parent_color = (unsigned long) parent;
  +   node-rb_left = node-rb_right = NULL;
  +}
 
 Is epoll the only rbtree-using code which exhibits this problem?  If so,
 what is epoll doing differently from all the others?

Dunno, but I don't think so. Epoll initializes the node to a state so that 
later on can check if the file-fd item is inserted or not. And it uses the 
parent information for that. But rb_set_parent() (that is used in the 
current initialization code) uses the data in the node, and this triggers 
the uninitialized memory access.
Taking a better look at it, the code (after the latest changes) has no 
more point-of-failures after the rb-tree insert, so I can probably avoid 
doing the node's parent-initialization and check altogether.
Yeah, let's that ;)

 eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Andrew, drop that one. I'm gonna send the new one after some test...



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi
On Thu, 20 Sep 2007, Nagendra Tomar wrote:

 
 --- Davide Libenzi [EMAIL PROTECTED] wrote:
 
  Looking back at it, I think the current TCP code is right, once you look 
  at the event to be a output buffer full-with_space transition.
  If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
  (free space on the output buffer), if you do not consume it (say a 
  tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
  anymore since they happen on the full-with_space transition.
  Yes, I know, the read size (EPOLLIN) works differently and you get an 
  event for every packet you receive. And yes, I do not like asymmetric 
  things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.
  
 
 I agree that ET means the event should happen at the transition
 from nospace-space condition, but isn't the other case (event is 
 delivered every time the event actually happens) more usable.
 Also the epoll man page says so
 
 ... Edge Triggered event distribution delivers events only when 
 events happens on the  monitored file.
 
 This serves the purpose of ET (reducing the number of poll events) and
 at the same time makes userspace coding easier. My userspace code
 has the liberty of deciding when it can write to the socket. f.e. the
 sendfile buffer management example that I quoted in my earlier post
 will be difficult with the current ET|POLLOUT behaviour. I cannot 
 write in full-buffer units. I'll ve to write partial buffers just to 
 fill the TCP writeq which is needed to trigger the event.

That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
ability to write, and it is not meant as to signal every time a packet 
(skb) is sent on the wire (and the buffer released).
In your particular application, you could simply split the sendfile into 
appropriately sized chunks, and handle the buffer realease in there.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-21 Thread Davide Libenzi
On Thu, 20 Sep 2007, Nagendra Tomar wrote:

  That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
  ability to write, and it is not meant as to signal every time a packet 
  (skb) is sent on the wire (and the buffer released).
 
 Aren't they both the same ? Everytime an incoming ACK frees up a buffer
 from the retransmit queue, the writability condition is freshly asserted,
 much the same way as the readability condition is asserted everytime a 
 new data is queued in the socket receive queue (irrespective of 
 whether there was data already waiting to be read in the receive queue).
 
 This difference in meaning of POLLOUT only arises in the ET case, which was
 not what traditional Unix poll referred to. 

Again, events here are readability and writeability and the fact that 
the lower network layer freed a buffer is not, per se, an event (unless 
before, writeability was tested and reported as unavailable).
In you case the solution looks pretty simple. Just create appropriately 
sized buffers, split the single sendfile into multiple buffer-sized ones, 
and recycle the buffer once each of them completes.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: A revised timerfd API

2007-09-22 Thread Davide Libenzi
On Sat, 22 Sep 2007, Michael Kerrisk wrote:

 So I'm inclined to implement option (b), unless someone has strong
 objections.  Davide, could I persuade you to help?

I guess I better do, otherwise you'll continue to stress me ;)

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

Patch below. Builds, not tested yet (you need to remove the broken 
status from CONFIG_TIMERFD in case you want to test - and plug the new 
syscall to arch/xxx).
May that work for you?
Thomas-san, hrtimer_try_to_cancel() does not touch -expires and I assume
it'll never do, granted?



- Davide



---
 fs/compat.c  |   32 --
 fs/timerfd.c |  144 +--
 include/linux/compat.h   |7 +-
 include/linux/syscalls.h |7 +-
 4 files changed, 126 insertions(+), 64 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-22 12:22:19.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-22 13:21:21.0 -0700
@@ -23,6 +23,7 @@
 
 struct timerfd_ctx {
struct hrtimer tmr;
+   int clockid;
ktime_t tintv;
wait_queue_head_t wqh;
int expired;
@@ -46,7 +47,7 @@
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -58,7 +59,7 @@
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -150,76 +151,109 @@
.read   = timerfd_read,
 };
 
-asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
-   const struct itimerspec __user *utmr)
+asmlinkage long sys_timerfd_create(int clockid)
 {
-   int error;
+   int error, ufd;
struct timerfd_ctx *ctx;
struct file *file;
struct inode *inode;
-   struct itimerspec ktmr;
-
-   if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
-   return -EFAULT;
 
if (clockid != CLOCK_MONOTONIC 
clockid != CLOCK_REALTIME)
return -EINVAL;
+
+   ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+   if (!ctx)
+   return -ENOMEM;
+
+   init_waitqueue_head(ctx-wqh);
+   ctx-clockid = clockid;
+
+   error = anon_inode_getfd(ufd, inode, file, [timerfd],
+timerfd_fops, ctx);
+   if (error)
+   goto err_kfree_ctx;
+
+   return ufd;
+
+err_kfree_ctx:
+   kfree(ctx);
+   return error;
+}
+
+asmlinkage long sys_timerfd_settime(int ufd, int flags,
+   const struct itimerspec __user *utmr,
+   struct itimerspec __user *otmr)
+{
+   struct file *file;
+   struct timerfd_ctx *ctx;
+   struct itimerspec ktmr, kotmr;
+
+   if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
+   return -EFAULT;
+
if (!timespec_valid(ktmr.it_value) ||
!timespec_valid(ktmr.it_interval))
return -EINVAL;
 
-   if (ufd == -1) {
-   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
-   if (!ctx)
-   return -ENOMEM;
-
-   init_waitqueue_head(ctx-wqh);
-
-   timerfd_setup(ctx, clockid, flags, ktmr);
-
-   /*
-* When we call this, the initialization must be complete, since
-* anon_inode_getfd() will install the fd.
-*/
-   error = anon_inode_getfd(ufd, inode, file, [timerfd],
-timerfd_fops, ctx);
-   if (error)
-   goto err_tmrcancel;
-   } else {
-   file = fget(ufd);
-   if (!file)
-   return -EBADF;
-   ctx = file-private_data;
-   if (file-f_op != timerfd_fops) {
-   fput(file);
-   return -EINVAL;
-   }
-   /*
-* We need to stop the existing timer before reprogramming
-* it to the new values.
-*/
-   for (;;) {
-   spin_lock_irq(ctx-wqh.lock);
-   if (hrtimer_try_to_cancel(ctx-tmr) = 0)
-   break;
-   spin_unlock_irq(ctx-wqh.lock);
-   cpu_relax();
- 

Re: RFC: A revised timerfd API

2007-09-22 Thread Davide Libenzi
On Sat, 22 Sep 2007, Thomas Gleixner wrote:

 On Sat, 2007-09-22 at 14:07 -0700, Davide Libenzi wrote:
  On Sat, 22 Sep 2007, Michael Kerrisk wrote:
  
   So I'm inclined to implement option (b), unless someone has strong
   objections.  Davide, could I persuade you to help?
  
  I guess I better do, otherwise you'll continue to stress me ;)
  
  int timerfd_create(int clockid);
  int timerfd_settime(int ufd, int flags,
  const struct itimerspec *utmr,
  struct itimerspec *otmr);
  int timerfd_gettime(int ufd, struct itimerspec *otmr);
  
  Patch below. Builds, not tested yet (you need to remove the broken 
  status from CONFIG_TIMERFD in case you want to test - and plug the new 
  syscall to arch/xxx).
  May that work for you?
  Thomas-san, hrtimer_try_to_cancel() does not touch -expires and I assume
  it'll never do, granted?
 
 Davide-san, I have no intention to change that, but remember there is
 this file Documentation/stable_api_nonsense.txt :)

Heh, I guess that'll work then ;)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: A revised timerfd API

2007-09-23 Thread Davide Libenzi
On Sun, 23 Sep 2007, Michael Kerrisk wrote:

 I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown
 in the definitions below.  When I ran the the program below, my system
 immediately froze.  Can you try it on your system please.

There's an hrtimer_init() missing in timerfd_create(). I'll refactor the 
patch.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: A revised timerfd API

2007-09-23 Thread Davide Libenzi
On Sun, 23 Sep 2007, Davide Libenzi wrote:

 On Sun, 23 Sep 2007, Michael Kerrisk wrote:
 
  I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown
  in the definitions below.  When I ran the the program below, my system
  immediately froze.  Can you try it on your system please.
 
 There's an hrtimer_init() missing in timerfd_create(). I'll refactor the 
 patch.

There's the case of a timerfd_gettime return status when the timerfd has 
not been set yet (ie, soon after a timerfd_create), to handle.
Current way is to return an (itimerspec) { 0, 0 }. Ok?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/3] new timerfd API - new timerfd API

2007-09-23 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 168 insertions(+), 77 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-23 15:18:09.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-23 15:25:55.0 -0700
@@ -23,15 +23,17 @@
 
 struct timerfd_ctx {
struct hrtimer tmr;
+   int clockid;
ktime_t tintv;
wait_queue_head_t wqh;
int expired;
+   u64 ticks;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,23 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
+   ticks += (u64)
hrtimer_forward(ctx-tmr,
hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired = 0;
+   ctx

[patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-23 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/i386/kernel/syscall_table.S |5 -
 arch/x86_64/ia32/ia32entry.S |4 +++-
 include/asm-i386/unistd.h|6 --
 include/asm-x86_64/unistd.h  |8 ++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/arch/i386/kernel/syscall_table.S  2007-09-23 
15:28:52.0 -0700
@@ -321,6 +321,9 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
+
Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S  2007-09-23 15:28:52.0 
-0700
@@ -730,7 +730,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:
Index: linux-2.6.mod/include/asm-i386/unistd.h
===
--- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-23 15:28:52.0 
-0700
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86_64/unistd.h
===
--- linux-2.6.mod.orig/include/asm-x86_64/unistd.h  2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/include/asm-x86_64/unistd.h   2007-09-23 15:28:52.0 
-0700
@@ -626,12 +626,16 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/3] new timerfd API - un-break CONFIG_TIMERFD

2007-09-23 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-09-23 15:18:06.0 -0700
+++ linux-2.6.mod/init/Kconfig  2007-09-23 15:28:54.0 -0700
@@ -488,7 +488,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi
On Mon, 24 Sep 2007, Michael Kerrisk wrote:

 Davide,
 
 Is it perhaps not better to group the three syscalls contiguously with
 respect to syscall numbers?  The old timerfd slot can be re-used for some
 other syscall later.

There's no problem if they're not contiguous. Holes, unless filled 
immediately, need to be remembered to be filled.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] new timerfd API - new timerfd API

2007-09-24 Thread Davide Libenzi
On Mon, 24 Sep 2007, Thomas Gleixner wrote:

   struct timerfd_ctx {
  struct hrtimer tmr;
  +   int clockid;
  ktime_t tintv;
  wait_queue_head_t wqh;
  int expired;
  +   u64 ticks;
   };
 
 Can you please restructure the struct in a way which does not result in
 padding by the compiler ?
 
 struct timerfd_ctx {
   struct hrtimer tmr;
   ktime_t tintv;
   wait_queue_head_t wqh;
   u64 ticks;
   int expired;
   int clockid;
 };

Sure.



  +   ticks += (u64)
  hrtimer_forward(ctx-tmr,
  hrtimer_cb_get_time(ctx-tmr),
 
 You need to use ctx-tmr.base-get_time() here, otherwise you might read
 a stale time value (in case that CONFIG_HIGH_RES_TIMERS is off).

Is the particular position of hrtimer_cb_get_time() in the code that would 
break here? Because function was added by your patch ;)
Did something change later?



  +err_kfree_ctx:
  +   kfree(ctx);
  +   return error;
 
 You really can avoid the goto here.

Ack.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi
On Mon, 24 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
 Davide Libenzi wrote:
  On Mon, 24 Sep 2007, Michael Kerrisk wrote:
  Is it perhaps not better to group the three syscalls contiguously with
  respect to syscall numbers?  The old timerfd slot can be re-used for some
  other syscall later.
  
  There's no problem if they're not contiguous. 
 
 I realise there is no problem, in a technical sense.  But it strikes me as
 more aesthetic to make related syscalls numerically contiguous.  Thus, we
 see such as the following in the kernel source
 
 #define __NR_epoll_create   254
 #define __NR_epoll_ctl  255
 #define __NR_epoll_wait 256
 
 and
 
 #define __NR_timer_create   259
 #define __NR_timer_settime  (__NR_timer_create+1)
 #define __NR_timer_gettime  (__NR_timer_create+2)
 #define __NR_timer_getoverrun   (__NR_timer_create+3)
 #define __NR_timer_delete   (__NR_timer_create+4)
 
 and
 
 #define __NR_inotify_init   291
 #define __NR_inotify_add_watch  292
 #define __NR_inotify_rm_watch   293
 
  Holes, unless filled 
  immediately, need to be remembered to be filled.
 
 Well, in the past it seems they do get filled soon enough though.  There's
 fair odds that you'll be the one to fill it with the next syscall you write
 ;-).

You have to talk to arch mantainers. I do not care. I simply provided the 
x86 hooks because I tested on x86.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] new timerfd API v2 - introduce a new hrtimer_forward_now() function

2007-09-24 Thread Davide Libenzi
I think that advancing the timer against the timer's current now can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-09-24 12:27:20.0 
-0700
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-09-24 12:29:39.0 
-0700
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer-base-get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec __user *rmtp,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] new timerfd API v2 - new timerfd API

2007-09-24 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-24 12:26:19.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-24 12:31:22.0 -0700
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(ctx-tmr,
-   hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ticks += (u64) hrtimer_forward_now(ctx-tmr,
+  ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired = 0

[patch 3/4] new timerfd API v2 - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/i386/kernel/syscall_table.S |5 -
 arch/x86_64/ia32/ia32entry.S |4 +++-
 include/asm-i386/unistd.h|6 --
 include/asm-x86_64/unistd.h  |8 ++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/arch/i386/kernel/syscall_table.S  2007-09-24 
12:31:39.0 -0700
@@ -321,6 +321,9 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
+
Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S  2007-09-24 12:31:39.0 
-0700
@@ -730,7 +730,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:
Index: linux-2.6.mod/include/asm-i386/unistd.h
===
--- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-24 12:31:39.0 
-0700
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86_64/unistd.h
===
--- linux-2.6.mod.orig/include/asm-x86_64/unistd.h  2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/include/asm-x86_64/unistd.h   2007-09-24 12:31:39.0 
-0700
@@ -626,12 +626,16 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] new timerfd API v2 - un-break CONFIG_TIMERFD

2007-09-24 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-09-24 12:13:25.0 -0700
+++ linux-2.6.mod/init/Kconfig  2007-09-24 12:31:40.0 -0700
@@ -488,7 +488,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/2] Enable link power management for ata drivers

2007-09-24 Thread Davide Libenzi
On Tue, 25 Sep 2007, roel wrote:

  +   if (!(ap-flags  ATA_FLAG_IPM) || !ata_dev_enabled(dev)) {
 
   if (!((ap-flags  ATA_FLAG_IPM)  ata_dev_enabled(dev))) {

int foo(int i, int j) {

return !(i  8) || !j;
}

int moo(int i, int j) {

return !((i  8)  j);
}


gcc -O2 -S:


.globl foo
.type   foo, @function
foo:
shrl$3, %edi
xorl$1, %edi
testl   %esi, %esi
sete%al
orl %eax, %edi
andl$1, %edi
movl%edi, %eax
ret
.globl moo
.type   moo, @function
moo:
shrl$3, %edi
xorl$1, %edi
testl   %esi, %esi
sete%al
orl %eax, %edi
andl$1, %edi
movl%edi, %eax
ret



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] new timerfd API v2 - new timerfd API

2007-09-25 Thread Davide Libenzi
On Tue, 25 Sep 2007, Jonathan Corbet wrote:

 One quick question:
 
  Like the previous timerfd API implementation, read(2) and poll(2) are 
  supported
  (with the same interface).
 
 Looking at that interface, it appears that a process doing a read() on a
 timerfd with no timer set will block for a very long time.  It's an
 obvious don't do that situation, but perhaps we could help an
 occasional developer get a clue by returning something like -EINVAL when
 the timer has not been set?

That is the same as you try to read once more after an expired timer. You 
won't wake up until the next timer event will show up. That is, after at 
most TP time for periodic timers, or after the time  the next 
timerfd_settime() will setup.
I'd like to keep the timerfd not set yet and the timerfd already 
expired and not re-armed acting the same way. That is, wait till next 
event happen (unless O_NONBLOCK of course).



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Man page for revised timerfd API

2007-09-26 Thread Davide Libenzi

Michael, SCB ...


On Wed, 26 Sep 2007, Michael Kerrisk wrote:

 .TH TIMERFD_CREATE 2 2007-09-26 Linux Linux Programmer's Manual
 .SH NAME
 timerfd_create, timerfd_settime, timer_gettime \-
 timers that notify via file descriptors
 .SH SYNOPSIS
 .\ FIXME . This header file may well change
 .\ FIXME . Probably _GNU_SOURCE will be required
 .\ FIXME . May require: Link with \fI\-lrt\f
 .nf
 .B #include sys/timerfd.h
 .sp
 .BI int timerfd_create(int  clockid );
 .sp
 .BI int timerfd_settime(int  fd , int  flags ,
 .BI const struct itimerspec * new_value ,
 .BI struct itimerspec * curr_value );
 .sp
 .BI int timerfd_gettime(int  fd , struct itimerspec * curr_value );
 .fi
 .SH DESCRIPTION
 These system calls create and operate on a timer
 that delivers timer expiration notifications via a file descriptor.
 They provide an alternative to the use of
 .BR setitimer (2)
 or
 .BR timer_create (3),
 with the advantage that the file descriptor may be monitored by
 .BR poll (2)
 and
 .BR select (2).

epoll, no?




 The use of these three system calls is analogous to the use of
 .BR timer_create (2),
 .BR timer_settime (2),
 and
 .BR timer_gettime (2).
 .\
 .SS timerfd_create()
 .BR timerfd_create ()
 creates a new timer object,
 and returns a file descriptor that refers to that timer.
 The
 .I clockid
 argument specifies the clock that is used to mark the progress
 of the timer, and must be either
 .B CLOCK_REALTIME
 or
 .BR CLOCK_MONOTONIC .
 .B CLOCK_REALTIME
 is a settable system-wide clock.
 .B CLOCK_MONOTONIC
 is a non-settable clock that is not affected
 by discontinuous changes in the system clock
 (e.g., manual changes to system time).
 The current value of each of these clocks can be retrieved using
 .BR clock_gettime (3).
 .\
 .SS timerfd_settime()
 .BR timerfd_settime ()
 arms (starts) or disarms (stops)
 the timer referred to by the file descriptor
 .IR fd .
 
 The
 .I new_value
 argument specifies the initial expiration and interval for the timer.
 The
 .I itimer
 structure used for this argument contains two fields,
 each of which is in turn a structure of type
 .IR timespec :
 .in +0.25i
 .nf
 
 struct timespec {
 time_t tv_sec;/* Seconds */
 long   tv_nsec;   /* Nanoseconds */
 };
 
 struct itimerspec {
 struct timespec it_interval;  /* Interval for periodic timer */
 struct timespec it_value; /* Initial expiration */
 };
 .fi
 .in
 .PP
 .I new_value.it_value
 specifies the initial expiration of the timer,
 in seconds and nanoseconds.
 Setting either field of
 .I new_value.it_value
 to a non-zero value arms the timer.
 Setting both fields of
 .I new_value.it_value
 to zero disarms the timer.
 
 Setting one or both fields of
 .I new_value.it_interval
 to non-zero values specifies the period, in seconds and nanoseconds,
 for repeated timer expirations after the initial expiration.
 If both fields of
 .I new_value.it_interval
 are zero, the timer expires just once, at the time specified by
 .IR new_value.it_value .
 
 The
 .I flags
 argument is either 0, to start a relative timer
 .RI ( new_value.it_interval
 specifies a time relative to the current value of the clock specified by
 .IR clockid ),
 or
 .BR TFD_TIMER_ABSTIME ,
 to start an absolute timer
 .RI ( new_value.it_interval
 specifies an absolute time for the clock specified by
 .IR clockid ;
 that is, the timer will expire when the value of that
 clock reaches the value specified in
 .IR new_value.it_interval ).
 
 The
 .I curr_value
 argument returns a structure containing the setting of the timer that
 was current at the time of the call; see the description of
 .BR timerfd_gettime ()
 following.
 .\
 .SS timerfd_gettime()
 .BR timerfd_gettime ()
 returns, in
 .IR curr_value ,
 an
 .IR itimerspec
 that contains the current setting of the timer
 referred to by the file descriptor
 .IR fd .
 
 The
 .I it_value
 field returns the amount of time
 until the timer will next expire.
 If both fields of this structure are zero,
 then the timer is currently disarmed.
 This field always contains a relative value, regardless of whether the
 .BR TFD_TIMER_ABSTIME
 flag was specified when setting the timer.
 
 The
 .I it_interval
 field returns the interval of the timer.
 If both fields of this structure are zero,
 then the timer is set to expire just once, at the time specified by
 .IR curr_value.it_value .
 .SS Operating on a timer file descriptor
 The file descriptor returned by
 .BR timerfd_create (2)
 supports the following operations:
 .TP
 .BR read (2)
 If the timer has already expired one or more times since it was created,
 or since the last
 .BR read (2),
 then the buffer given to
 .BR read (2)
 returns an unsigned 8-byte integer
 .RI ( uint64_t )
 containing the number of expirations that have occurred.
 .IP
 If no timer expirations have occurred at the time of the
 .BR read (2),
 then the call either blocks until the next timer expiration,
 or fails with the 

Re: Man page for revised timerfd API

2007-09-27 Thread Davide Libenzi
On Thu, 27 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
 A follow up to the man page text.  Does passing a timerfd file
 descriptor via a Unix domain socket to another process do the
 expected thing?  That is, the receiving process will be able to 
 read from the file descriptor in order to obtain notification
 of timer expirations that are occurring for the process
 that sent the file descriptor, right?

Yup.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Man page for revised timerfd API

2007-09-27 Thread Davide Libenzi
On Thu, 27 Sep 2007, Michael Kerrisk wrote:

 Davide,
 
 A further question: what is the expected behavior in the
 following scenario:
 
 1. Create a timerfd and arm it.
 2. Wait until M timer expirations have occurred
 3. Modify the settings of the timer
 4. Wait for N further timer expirations have occurred
 5. read() from the timerfd
 
 Does the buffer returned by the read() contain the value
 N or (M+N)?  In other words, should modifying the timer
 settings reset the expiration count to zero?

Every timerfd_settime() zeroes the tick counter. So in your scenario it'll 
return N.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: eventfd.2 man page

2007-09-27 Thread Davide Libenzi
On Thu, 27 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
 I've slightly tweaked the eventfd.2 man page in preparation for adding it
 to the man-pages set.  Could you please review the text below, and confirm
 that it correctly describes intended behavior.

Looks good to me. At the time that I was tsting it, I cooked an example 
about how to use an eventfd with KAIO:

http://www.xmailserver.org/eventfd-aio-test.c

In case you'd like to add a sketch to the man page.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revised signalfd man-page

2007-09-27 Thread Davide Libenzi
On Thu, 27 Sep 2007, Michael Kerrisk wrote:

 .\ Copyright (C) 2007 Michael Kerrisk [EMAIL PROTECTED]
 .\ starting from a version by Davide Libenzi [EMAIL PROTECTED]
 .\
 .\ This program is free software; you can redistribute it and/or modify
 .\ it under the terms of the GNU General Public License as published by
 .\ the Free Software Foundation; either version 2 of the License, or
 .\ (at your option) any later version.
 .\
 .\ This program is distributed in the hope that it will be useful,
 .\ but WITHOUT ANY WARRANTY; without even the implied warranty of
 .\ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 .\ GNU General Public License for more details.
 .\
 .\ You should have received a copy of the GNU General Public License
 .\ along with this program; if not, write to the Free Software
 .\ Foundation, Inc., 59 Temple Place, Suite 330, Boston,
 .\ MA  02111-1307  USA
 .\
 .TH SIGNALFD 2 2007-09-27 Linux Linux Programmer's Manual
 .SH NAME
 signalfd \- create a file descriptor for accepting signals
 .SH SYNOPSIS
 .\ FIXME . This header file may well change
 .\ FIXME . Probably _GNU_SOURCE will be required
 .\ FIXME . May require: Link with \fI\-lrt\f
 .B #include sys/signalfd.h
 .sp
 .BI int signalfd(int  fd , const sigset_t * mask );
 .\ Almost certainly the glibc wrapper will hide this argument:
 .\ , size_t  sizemask
 .SH DESCRIPTION
 .BR signalfd (2)
 creates a file descriptor that can be used to accept signals
 targeted at the caller.
 This provides an alternative to the use of a signal handler or
 .BR sigwaitinfo (2),
 and has the advantage that the file descriptor may be monitored by
 .BR select (2),
 .BR poll (2),
 and
 .BR epoll (7).
 The
 .I mask
 argument specifies the set of signals that the caller
 wishes to accept via the file descriptor.
 This argument is a signal set whose contents can be initialized
 using the macros described in
 .BR sigsetops (3).
 Normally, the set of signals to be received via the
 file descriptor should be blocked using
 .BR sigprocmask (2),
 to prevent the signals being handled according to their default
 dispositions.
 It is not possible to receive
 .B SIGKILL
 or
 .B SIGSTOP
 signals via a signalfd file descriptor;
 these signals are silently ignored if specified in
 .IR mask .
 
 If the
 .I fd
 argument is \-1,
 then the call creates a new file descriptor and associates the
 signal set specified in
 .I mask
 with that descriptor.
 If
 .I fd
 is not \-1,
 then it must specify a valid existing signalfd file descriptor, and
 .I mask
 is used to replace the signal set associated with that descriptor.
 
 .BR signalfd (2)
 returns a file descriptor that supports the following operations:
 .TP
 .BR read (2)
 If one or more of the signals specified in
 .I mask
 is pending for the process, then the buffer supplied to
 .BR read (2)
 is used to return one or more
 .I signalfd_siginfo
 structures (see below) that describe the signals.
 The
 .BR read (2)
 returns information for as many signals as are pending and will
 fit in the supplied buffer.
 The buffer must be at least
 .I sizeof(struct signalfd_siginfo)
 bytes.
 The return value of the
 .BR read (2)
 is the total number of bytes read.
 .IP
 As a consequence of the
 .BR read (2),
 the signals are consumed,
 so that they are no longer pending for the process
 (i.e., will not be caught by signal handlers,
 and cannot be accepted using
 .BR sigwaitinfo (2)).
 .IP
 If none of the signals in
 .I mask
 is pending for the process, then the
 .BR read (2)
 either blocks until one of the signals in
 .I mask
 is generated for the process,
 or fails with the error
 .B EAGAIN
 if the file descriptor has been made non-blocking
 (via the use of the
 .BR fcntl (2)
 .B F_SETFL
 operation to set the
 .B O_NONBLOCK
 flag).
 
 .\ FIXME Davide, what does the following mean?  How (in userspace
 .\ terms) does a sighand structure become orphaned?
 The
 .BR read (2)
 call can also return 0,
 in case the sighand structure to which the signalfd was attached,
 has been orphaned.

You can remove the five lines above, in virtue of the fact that Linus 
merged my simplification patch.




 .TP
 .BR poll (2),  select (2) (and similar)
 The file descriptor is readable
 (the
 .BR select (2)
 .I readfds
 argument; the
 .BR poll (2)
 .B POLLIN
 flag)
 if one or more of the signals in
 .I mask
 is pending for the process.
 .IP
 The signalfd file descriptor also supports the other file-descriptor
 multiplexing APIs:
 .BR pselect (2),
 .BR ppoll (2),
 and
 .BR epoll (7).
 .TP
 .BR close (2)
 When the file descriptor is no longer required it should be closed.
 When all file descriptors associated with the same signalfd object
 have been closed, the resources for object are freed by the kernel.
 .SS The signalfd_siginfo structure
 The format of the
 .I signalfd_siginfo
 structure(s) returned by
 .BR read (2)s
 from a signalfd file descriptor is as follows:
 .in +0.5i
 .nf
 
 .\ FIXME Davide, a question: why rename these fields

[patch] rename signalfd_siginfo fields

2007-09-27 Thread Davide Libenzi

For Michael Kerrisk request, the following patch renames signalfd_siginfo 
fields in order to keep them consistent with the siginfo_t ones.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/signalfd.c|   44 ++--
 include/linux/signalfd.h |   32 
 2 files changed, 38 insertions(+), 38 deletions(-)

Index: linux-2.6.mod/fs/signalfd.c
===
--- linux-2.6.mod.orig/fs/signalfd.c2007-09-27 10:31:48.0 -0700
+++ linux-2.6.mod/fs/signalfd.c 2007-09-27 10:33:45.0 -0700
@@ -74,45 +74,45 @@
 * If you change siginfo_t structure, please be sure
 * this code is fixed accordingly.
 */
-   err |= __put_user(kinfo-si_signo, uinfo-signo);
-   err |= __put_user(kinfo-si_errno, uinfo-err);
-   err |= __put_user((short)kinfo-si_code, uinfo-code);
+   err |= __put_user(kinfo-si_signo, uinfo-ssi_signo);
+   err |= __put_user(kinfo-si_errno, uinfo-ssi_errno);
+   err |= __put_user((short) kinfo-si_code, uinfo-ssi_code);
switch (kinfo-si_code  __SI_MASK) {
case __SI_KILL:
-   err |= __put_user(kinfo-si_pid, uinfo-pid);
-   err |= __put_user(kinfo-si_uid, uinfo-uid);
+   err |= __put_user(kinfo-si_pid, uinfo-ssi_pid);
+   err |= __put_user(kinfo-si_uid, uinfo-ssi_uid);
break;
case __SI_TIMER:
-err |= __put_user(kinfo-si_tid, uinfo-tid);
-err |= __put_user(kinfo-si_overrun, uinfo-overrun);
-err |= __put_user((long)kinfo-si_ptr, uinfo-svptr);
+err |= __put_user(kinfo-si_tid, uinfo-ssi_tid);
+err |= __put_user(kinfo-si_overrun, uinfo-ssi_overrun);
+err |= __put_user((long) kinfo-si_ptr, uinfo-ssi_ptr);
break;
case __SI_POLL:
-   err |= __put_user(kinfo-si_band, uinfo-band);
-   err |= __put_user(kinfo-si_fd, uinfo-fd);
+   err |= __put_user(kinfo-si_band, uinfo-ssi_band);
+   err |= __put_user(kinfo-si_fd, uinfo-ssi_fd);
break;
case __SI_FAULT:
-   err |= __put_user((long)kinfo-si_addr, uinfo-addr);
+   err |= __put_user((long) kinfo-si_addr, uinfo-ssi_addr);
 #ifdef __ARCH_SI_TRAPNO
-   err |= __put_user(kinfo-si_trapno, uinfo-trapno);
+   err |= __put_user(kinfo-si_trapno, uinfo-ssi_trapno);
 #endif
break;
case __SI_CHLD:
-   err |= __put_user(kinfo-si_pid, uinfo-pid);
-   err |= __put_user(kinfo-si_uid, uinfo-uid);
-   err |= __put_user(kinfo-si_status, uinfo-status);
-   err |= __put_user(kinfo-si_utime, uinfo-utime);
-   err |= __put_user(kinfo-si_stime, uinfo-stime);
+   err |= __put_user(kinfo-si_pid, uinfo-ssi_pid);
+   err |= __put_user(kinfo-si_uid, uinfo-ssi_uid);
+   err |= __put_user(kinfo-si_status, uinfo-ssi_status);
+   err |= __put_user(kinfo-si_utime, uinfo-ssi_utime);
+   err |= __put_user(kinfo-si_stime, uinfo-ssi_stime);
break;
case __SI_RT: /* This is not generated by the kernel as of now. */
case __SI_MESGQ: /* But this is */
-   err |= __put_user(kinfo-si_pid, uinfo-pid);
-   err |= __put_user(kinfo-si_uid, uinfo-uid);
-   err |= __put_user((long)kinfo-si_ptr, uinfo-svptr);
+   err |= __put_user(kinfo-si_pid, uinfo-ssi_pid);
+   err |= __put_user(kinfo-si_uid, uinfo-ssi_uid);
+   err |= __put_user((long) kinfo-si_ptr, uinfo-ssi_ptr);
break;
default: /* this is just in case for now ... */
-   err |= __put_user(kinfo-si_pid, uinfo-pid);
-   err |= __put_user(kinfo-si_uid, uinfo-uid);
+   err |= __put_user(kinfo-si_pid, uinfo-ssi_pid);
+   err |= __put_user(kinfo-si_uid, uinfo-ssi_uid);
break;
}
 
Index: linux-2.6.mod/include/linux/signalfd.h
===
--- linux-2.6.mod.orig/include/linux/signalfd.h 2007-09-27 10:28:59.0 
-0700
+++ linux-2.6.mod/include/linux/signalfd.h  2007-09-27 10:31:31.0 
-0700
@@ -10,22 +10,22 @@
 
 
 struct signalfd_siginfo {
-   __u32 signo;
-   __s32 err;
-   __s32 code;
-   __u32 pid;
-   __u32 uid;
-   __s32 fd;
-   __u32 tid;
-   __u32 band;
-   __u32 overrun;
-   __u32 trapno;
-   __s32 status;
-   __s32 svint;
-   __u64 svptr;
-   __u64 utime;
-   __u64 stime;
-   __u64 addr;
+   __u32 ssi_signo;
+   __s32 ssi_errno;
+   __s32 ssi_code;
+   __u32 ssi_pid;
+   __u32 ssi_uid;
+   __s32 ssi_fd;
+   __u32 ssi_tid;
+   __u32

[patch] anon-inodes use open coded atomic_inc for the shared inode

2007-09-27 Thread Davide Libenzi
Since we know the shared inode count is always 0, we can avoid igrab() 
and use an open coded atomic_inc().


Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/anon_inodes.c |   25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

Index: linux-2.6.mod/fs/anon_inodes.c
===
--- linux-2.6.mod.orig/fs/anon_inodes.c 2007-09-27 11:26:57.0 -0700
+++ linux-2.6.mod/fs/anon_inodes.c  2007-09-27 11:51:42.0 -0700
@@ -76,7 +76,6 @@
 {
struct qstr this;
struct dentry *dentry;
-   struct inode *inode;
struct file *file;
int error, fd;
 
@@ -86,15 +85,9 @@
if (!file)
return -ENFILE;
 
-   inode = igrab(anon_inode_inode);
-   if (IS_ERR(inode)) {
-   error = PTR_ERR(inode);
-   goto err_put_filp;
-   }
-
error = get_unused_fd();
if (error  0)
-   goto err_iput;
+   goto err_put_filp;
fd = error;
 
/*
@@ -108,14 +101,22 @@
dentry = d_alloc(anon_inode_mnt-mnt_sb-s_root, this);
if (!dentry)
goto err_put_unused_fd;
+
+   /*
+* We know the anon_inode inode count is always greater than zero,
+* so we can avoid doing an igrab() and we can use an open-coded
+* atomic_inc().
+*/
+   atomic_inc(anon_inode_inode-i_count);
+
dentry-d_op = anon_inodefs_dentry_operations;
/* Do not publish this dentry inside the global dentry hash table */
dentry-d_flags = ~DCACHE_UNHASHED;
-   d_instantiate(dentry, inode);
+   d_instantiate(dentry, anon_inode_inode);
 
file-f_path.mnt = mntget(anon_inode_mnt);
file-f_path.dentry = dentry;
-   file-f_mapping = inode-i_mapping;
+   file-f_mapping = anon_inode_inode-i_mapping;
 
file-f_pos = 0;
file-f_flags = O_RDWR;
@@ -127,14 +128,12 @@
fd_install(fd, file);
 
*pfd = fd;
-   *pinode = inode;
+   *pinode = anon_inode_inode;
*pfile = file;
return 0;
 
 err_put_unused_fd:
put_unused_fd(fd);
-err_iput:
-   iput(inode);
 err_put_filp:
put_filp(file);
return error;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: F_DUPFD_CLOEXEC implementation

2007-09-28 Thread Davide Libenzi
On Fri, 28 Sep 2007, Ulrich Drepper wrote:

 One more small change to extend the availability of creation of
 file descriptors with FD_CLOEXEC set.  Adding a new command to
 fcntl() requires no new system call and the overall impact on
 code size if minimal.
 
 If this patch gets accepted we will also add this change to the
 next revision of the POSIX spec.
 
 To test the patch, use the following little program.  Adjust the
 value of F_DUPFD_CLOEXEC appropriately.

I think new system calls would have been a cleaner way to accomplish this. 
The small pill at a time may have better chance to go in, but will 
likely result in an uglier userspace interface.
In any case, this is better than *nothing*, if it makes it easier to use 
fds inside system libraries.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] i386: remove comment about barriers

2007-09-29 Thread Davide Libenzi
On Sat, 29 Sep 2007, Nick Piggin wrote:

 [ This is true for x86's sfence/lfence, but raises a question about Linux's
 memory barriers. Does anybody expect that a sequence of smp_wmb and smp_rmb
 would ever provide a full smp_mb barrier? I've always assumed no, but I
 don't know if it is actually documented? ]

No, that can't be. rmb+wmb can't be considered a full mb. There was a 
recent discussion about this in the thread originated by peterz scalable 
rw_mutex patches.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Davide Libenzi
On Sun, 30 Sep 2007, Oleg Nesterov wrote:

 Ah, but I asked the different question. We must see CPU 1's stores by
 definition, but what about CPU 0's stores (which could be seen by CPU 1)?
 
 Let's take a real life example,
 
 A = B = X = 0;
 P = Q = A;
 
 CPU_0   CPU_1   CPU_2
 
 P = B; *P = 1; if (X) {
 wmb();  rmb();
 X = 1;  BUG_ON(*P != 1  *Q != 1);
 }
 
 So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?

That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
from a cache inv. POV) to *P=1, that must have happened after P=B (in 
order for *P to assign B). So P=B happened, from a pure time POV, before 
the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: F_DUPFD_CLOEXEC implementation

2007-09-30 Thread Davide Libenzi
On Sun, 30 Sep 2007, Denys Vlasenko wrote:

 Hi Ulrich,
 
 On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
  One more small change to extend the availability of creation of
  file descriptors with FD_CLOEXEC set.  Adding a new command to
  fcntl() requires no new system call and the overall impact on
  code size if minimal.
 
 Tangential question: do you have any idea how userspace can
 safely do nonblocking read or write on a potentially-shared fd?
 
 IIUC, currently it cannot be done without races:
 
 old_flags = fcntl(fd, F_GETFL);
 ...other process may change flags!...
 fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
 read(fd, ...)
 ...other process may see flags changed under its feet!...
 fcntl(fd, F_SETFL, old_flags);
 
 Can this be fixed?

I'm not sure I understood correctly your use case. But, if you have two 
processes/threads randomly switching O_NONBLOCK on/off, your problems 
arise not only the F_SETFL time.
If one of the tasks is not expecting an fd to be O_NONBLOCK, that will 
likely end up not handling correctly read/write-miss situations.
In that case it'd be better to keep the fd as O_NONBLOCK, and manually 
create blocking behaviour (when needed) with poll+read/write.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: F_DUPFD_CLOEXEC implementation

2007-09-30 Thread Davide Libenzi
On Mon, 1 Oct 2007, Denys Vlasenko wrote:

 My use case is: I want to do a nonblocking read on descriptor 0 (stdin).
 It may be a pipe or a socket.
 
 There may be other processes which share this descriptor with me,
 I simply cannot know that. And they, too, may want to do reads on it.
 
 I want to do nonblocking read in such a way that neither those other
 processes will ever see fd switching to O_NONBLOCK and back, and
 I also want to be safe from other processes doing the same.
 
 I don't see how this can be done using standard unix primitives.

Indeed. You could simulate non-blocking using poll with zero timeout, but 
if another task may read/write on it, your following read/write may end up 
blocking even after a poll returned the required events.
One way to solve this would be some sort of readx/writex where you pass an 
extra flags parameter (this could be done with sys_indirect, assuming 
we'll ever get that mainline) where you specify the non-blocking 
requirement for-this-call, and not as global per-file flag. Then, of 
course, you'll have to modify all the file-f_flags  O_NONBLOCK tests 
(and there are many of them) to check for that flag too (that can be a 
per task_struct flag).



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Sun, 30 Sep 2007, Paul E. McKenney wrote:

 On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
  On Sun, 30 Sep 2007, Oleg Nesterov wrote:
  
   Ah, but I asked the different question. We must see CPU 1's stores by
   definition, but what about CPU 0's stores (which could be seen by CPU 1)?
   
   Let's take a real life example,
   
   A = B = X = 0;
   P = Q = A;
   
   CPU_0   CPU_1   CPU_2
   
   P = B; *P = 1; if (X) {
   wmb();  rmb();
   X = 1;  BUG_ON(*P != 1  *Q != 1);
   }
   
   So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ?
  
  That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
  from a cache inv. POV) to *P=1, that must have happened after P=B (in 
  order for *P to assign B). So P=B happened, from a pure time POV, before 
  the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too.
 
 Actually, CPU designers have to go quite a ways out of their way to
 prevent this BUG_ON from happening.  One way that it would happen
 naturally would be if the cache line containing P were owned by CPU 2,
 and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
 here is what could happen given careless or sadistic CPU designers:

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
local CPU load ordering only, things are different, clearly. But ...



 
 o CPU 0 stores B to P, but misses the cache, so puts the
   result in the store buffer.  This means that only CPUs 0 and 1
   can see it.
 
 o CPU 1 fetches P, and sees B, so stores a 1 to B.  Again,
   this value for P is visible only to CPUs 0 and 1.
 
 o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
   in order.  But it does nothing about CPU 0's stores, nor about CPU
   1's loads, for that matter (and the only reason that POWER ends
   up working the way you would like is because wmb() turns into
   sync rather than the eieio instruction that would have been
   used for smp_wmb() -- which is maybe what Oleg was thinking of,
   but happened to abbreviate.  If my analysis is buggy, Anton and
   Paulus will no doubt correct me...)

If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
held in the buffer at the time of CPU_1's sync.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: F_DUPFD_CLOEXEC implementation

2007-10-01 Thread Davide Libenzi
On Mon, 1 Oct 2007, Denys Vlasenko wrote:

 On Monday 01 October 2007 19:16, Al Viro wrote:
  * it's on a bunch of cyclic lists.  Have its neighbor
  go away while you are doing all that crap = boom
  * there's that thing call current position...  It gets buggered.
  * overwriting it while another task might be in the middle of
  syscall involving it = boom
 
 Hm, I suspected that it's herecy. Any idea how to do it cleanly?
 
  * non-cooperative tasks reading *in* *parallel* from the same
  opened file are going to have a lot more serious problems than agreeing
  on O_NONBLOCK anyway, so I really don't understand what the hell is that 
  for.
 
 They don't even need to read in parallel, just having shared fd is enough.
 Think about pipes, sockets and terminals. A real-world scenario:
 
 * a process started from shell (interactive or shell script)
 * it sets O_NONBLOCK and does a read from fd 0...
 * it gets killed (kill -9, whatever)
 * shell suddenly has it's fd 0 in O_NONBLOCK mode
 * shell and all subsequent commands started from it unexpectedly have
   O_NONBLOCKed stdin.

I told you how in the previous email. You cannot use the:

1) set O_NONBLOCK
2) read/write
3) unset O_NONBLOCK

in a racy-free fashion, w/out wrapping it with a lock (thing that we 
don't want to do).



PS: send/recv are socket functions, and you really don't want to overload 
them for other fds.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi
On Mon, 1 Oct 2007, Paul E. McKenney wrote:

 That would indeed be one approach that CPU designers could take to
 avoid being careless or sadistic.  ;-)

That'd be the easier (unique maybe) approach too for them, from an silicon 
complexity POV. Distinguishing between different CPUs stores once inside a 
shared store buffer, would require tagging them in some way. That'd defeat 
most of the pros of having a shared store buffer ;)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: F_DUPFD_CLOEXEC implementation

2007-10-02 Thread Davide Libenzi
On Tue, 2 Oct 2007, Denys Vlasenko wrote:

 I have following proposals:
 
 * make recv(..., MSG_DONTWAIT) work on any fd
 
 Sounds neat, but not trivial to implement in current kernel.

This is mildly ugly, if you ask me. Those are socket functions, and the 
flags parameter contain some pretty specific network meanings.



 * new fcntl command F_DUPFL: fcntl(fd, F_DUPFL, n):
   Analogous to F_DUPFD, but gives you *unshared* copy of the fd.
   Further seeks, fcntl(fd, F_SETFL, O_NONBLOCK), etc won't affect
   any other process.

You'll need an ad-hoc copy function though, since your memcpy-based one is 
gonna explode even before memcpy returns ;) You'll have problems with 
ref-counting too. And that layer is not designed to cleanly support that 
operation.
Unfortunately the clean solution would involve changing a whole bunch of 
code, and I don't feel exactly sure it'd be worth it.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC + PATCH] signalfd simplification

2007-09-02 Thread Davide Libenzi
On Sun, 2 Sep 2007, Oleg Nesterov wrote:

 We can optimize this later, using a clever wait_queue_func_t if needed.

Good idea. Will do ...


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-04 Thread Davide Libenzi
On Tue, 4 Sep 2007, Andrew Morton wrote:

  On Tue, 04 Sep 2007 10:03:56 +0200 Michael Kerrisk [EMAIL PROTECTED] 
  wrote:
  Davide,
  
   Davide -- ping!  Can you please offer your comments about this change, 
   and
   also thoughts on Jon's and my comments about a more radical API change
   later  in this thread.
   
   IMO the complexity of the resulting API (and resulting patch), and the 
   ABI 
   change, is not justified by the added value.
  
  Neither of the proposed APIs (either my multiplexed version of timerfd()
  or Jon's/my idea of using three system calls (like POSIX timers), or
  the notion of timerfd() integrated with POSIX timers) is more
  complicated than the existing POSIX timers API.
  
  The ABI change doesn't really matter, since timerfd() was broken in 2.6.22
  anyway.
  
  Both previous APIs provided the features I have described provide:
  
  * the ability to fetch the old timer value when applying
a new setting
  
  * the ability to non-destructively fetch the amount of time remaining
on a timer.
  
  This is clearly useful for timers -- but you have not explained why
  you think this is not necessary for timerfd timers.
 
 wakes up
 
 I'd have thought that the existing stuff would be near-useless without the
 capabilities which you describe?

Useless like it'd be a motorcycle w/out a cup-holder :)
Seriously, the ability to get the previous values from something could 
have a meaning if this something is a shared global resource (like signals 
for example). In the timerfd case this makes little sense, since you can 
create as many timerfd as you like and you do not need to share a single 
one by changing/restoring the original context.
On top of that, the cup-holder addition would cost in terms of API clarity 
(or in terms of two additional system calls in the other case), and in 
terms of kernel code footprint. Costs that IMO are not balanced by the 
added values.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-04 Thread Davide Libenzi
On Tue, 4 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
   wakes up
   
   I'd have thought that the existing stuff would be near-useless without
   the capabilities which you describe?
  
  Useless like it'd be a motorcycle w/out a cup-holder :)
  Seriously, the ability to get the previous values from something could 
  have a meaning if this something is a shared global resource (like 
  signals
  for example). In the timerfd case this makes little sense, since you can 
  create as many timerfd as you like and you do not need to share a single 
  one by changing/restoring the original context.
 
 However, one can have multipe POSIX timers, just as you can 
 have multiple timerfd timers; nevertheless POSIX timers provide
 the get and get-while-setting functionality.

The fact that POSIX defined a certain API in a given way, does not 
automatically mean that every other API has to look exactly like that.
POSIX has the tendency to bloat things up at times ;)



  and in terms of kernel code footprint.
 
 Not sure what your concern is here.  The total amount of 
 new code for all of these options is pretty small.

From your patch:

fs/compat.c  |   34 --
fs/timerfd.c |  147 +++
include/linux/compat.h   |3 
include/linux/syscalls.h |3 
4 files changed, 153 insertions(+), 34 deletions(-)

And the API definition becomes pretty messy. The other way is to add new 
system calls. 120+ lines of code more of new system calls wouldn't even be 
a problem in itself, if the added value was there.
IMO, as I already said, the added value does not justify them.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-04 Thread Davide Libenzi
On Tue, 4 Sep 2007, Michael Kerrisk wrote:

  Useless like it'd be a motorcycle w/out a cup-holder :)
  Seriously, the ability to get the previous values from something could 
  have a meaning if this something is a shared global resource (like 
  signals
  for example). In the timerfd case this makes little sense, since you can 
  create as many timerfd as you like and you do not need to share a single 
  one by changing/restoring the original context.
 
 Davide,
 
 As I think about this more, I see more problems with
 your argument.  timerfd needs the ability to get and 
 get-while-setting just as much as the earlier APIs.
 Consider a library that creates a timerfd file descriptor that
 is handed off to an application: that library may want
 to modify the timer settings without having to create a
 new file descriptor (the app mey not be able to be told about
 the new fd).  Your argument just doesn't hold, AFAICS.

Such hypotethical library, in case it really wanted to offer such 
functionality, could simply return an handle instead of the raw fd, and 
take care of all that stuff in userspace.
Again, mimicking POSIX APIs doesn't always take you in the right place.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi
On Wed, 5 Sep 2007, Michael Kerrisk wrote:

 Davide,

A Michael!


   As I think about this more, I see more problems with
   your argument.  timerfd needs the ability to get and 
   get-while-setting just as much as the earlier APIs.
   Consider a library that creates a timerfd file descriptor that
   is handed off to an application: that library may want
   to modify the timer settings without having to create a
   new file descriptor (the app mey not be able to be told about
   the new fd).  Your argument just doesn't hold, AFAICS.
  
  Such hypotethical library, in case it really wanted to offer such 
  functionality, could simply return an handle instead of the raw fd, and 
  take care of all that stuff in userspace.
 
 Did I miss something?  Is it not the case that as soon as the
 library returns a handle, rather than an fd, then the whole
 advantage of timerfd() (being able to select/poll/epoll on 
 the timer as well as other fds) is lost?  

Why? The handle would simply be a little struct where the timerfd fd is 
stored, and a XXX_getfd() would return it.
So my point is, I doubt such functionalities are really needed, and I also 
argue that the kernel is the best place for such wrapper code to go.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi
On Wed, 5 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
 As I think about this more, I see more problems with
 your argument.  timerfd needs the ability to get and 
 get-while-setting just as much as the earlier APIs.
 Consider a library that creates a timerfd file descriptor that
 is handed off to an application: that library may want
 to modify the timer settings without having to create a
 new file descriptor (the app mey not be able to be told about
 the new fd).  Your argument just doesn't hold, AFAICS.

Such hypotethical library, in case it really wanted to offer such 
functionality, could simply return an handle instead of the raw fd,
and take care of all that stuff in userspace.
   
   Did I miss something?  Is it not the case that as soon as the
   library returns a handle, rather than an fd, then the whole
   advantage of timerfd() (being able to select/poll/epoll on 
   the timer as well as other fds) is lost?  
  
  Why? The handle would simply be a little struct where the timerfd fd is 
  stored, and a XXX_getfd() would return it.
  So my point is, I doubt such functionalities are really needed, and I 
  also argue that the kernel is the best place for such wrapper code
  to go.
 
 So what happens if one thread (via the library) wants modify
 a timer's settings at the same timer as another thread is 
 select()ing on it?  The first thread can't do this by creating
 a new timerfd timer, since it wants to affect the select()
 in the other thread?

It can be done w/out any problems. The select thread will be notified 
whenever the new timer setting expires.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi
On Thu, 6 Sep 2007, Michael Kerrisk wrote:

 Hi Davide,
 
   As I think about this more, I see more problems with
   your argument.  timerfd needs the ability to get and 
   get-while-setting just as much as the earlier APIs.
   Consider a library that creates a timerfd file descriptor that
   is handed off to an application: that library may want
   to modify the timer settings without having to create a
   new file descriptor (the app mey not be able to be told about
   the new fd).  Your argument just doesn't hold, AFAICS.
  
  Such hypotethical library, in case it really wanted to offer such 
  functionality, could simply return an handle instead of the raw
  fd, and take care of all that stuff in userspace.
 
 Did I miss something?  Is it not the case that as soon as the
 library returns a handle, rather than an fd, then the whole
 advantage of timerfd() (being able to select/poll/epoll on 
 the timer as well as other fds) is lost?  

Why? The handle would simply be a little struct where the timerfd fd
is 
stored, and a XXX_getfd() would return it.
So my point is, I doubt such functionalities are really needed, and I 
also argue that the kernel is the best place for such wrapper code
to go.
   
   So what happens if one thread (via the library) wants modify
   a timer's settings at the same timer as another thread is 
   select()ing on it?  The first thread can't do this by creating
   a new timerfd timer, since it wants to affect the select()
   in the other thread?
  
  It can be done w/out any problems. The select thread will be notified 
  whenever the new timer setting expires.
 
 We are going in circles here.  I think you are missing my point.
 Consider the following
 
 [[
 Thread A: calls library function which creates a timerfd file
 descriptor.
 
 Thread B: calls select() on the timerfd file descriptor.
 
 Thread A: calls library function which wants to:
a) modify timer settings, and retrieve copy of current timer
   settings, and later
b) restore old timer settings.
 ]]
 
 This seems a quite reasonable use-case to me, and the existing
 interface simply can't support it.

Quite reasonable? :)
I honestly doubt it, but anyway. Modulo error checking:

struct tfd {
int fd, clockid;
struct itimerspec ts;
};

struct tfd *tfd_create(int clockid, int flags, const struct itimerspec *ts) {
struct tfd *th;
th = malloc(sizeof(*th));
th-clockid = clockid;
th-ts = *ts;
th-fd = timerfd(-1, clockid, flags, ts);
return th;
}

void tfd_close(struct tfd *th) {
close(th-fd);
free(th);
}

int tfd_getfd(const struct tfd *th) {
return th-fd;
}

int tfd_gettime(const struct tfd *th, int *clockid, struct itimerspec *ts) {
*clockid = th-clockid;
*ts = th-ts;
return 0;
}

int tfd_settime(struct tfd *th, int clockid, int flags,
const struct itimerspec *ts) {
th-fd = timerfd(th-fd, clockid, flags, ts);
th-clockid = clockid;
th-ts = *ts;
return 0;
}

Wrap the get/set with a mutex in case you plan to shoot yourself in a foot 
by doing get/set from multiple threads ;)
So, once again:

- I sincerly doubt the above is common usage/design patters for timerfds

  * timerfds are not a common global resource, ala signals, that requires 
get+set+restore pattern - you can have many of them set to different 
times

- Those IMO *very* special use cases can be handled in userspace with few 
  lines of code, *if* really needed



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >