from:"Pierre . Peiffer"

Re: [PATCH 2.6.24-mm1] error compiling net driver NE2000/NE1000

2008-02-18 Thread Pierre Peiffer

Hi,

I don't know if I have to warn on this or not, but as I didn't find any
discussion, it's probably better to mention it: the compiling error reported
below (or here: http://lkml.org/lkml/2008/2/4/173 ) does not seem to be
corrected in 2.6.25-rc2.mm1...  So, I don't know if a fix is going on somewhere
or if the bug has fallen in a black hole.

(In the original mail, I've proposed a patch as a quick fix, but I don't know if
it can be considered as a definitive correction or not)

Thanks,

P.

Andrew Morton wrote:
> On Mon, 4 Feb 2008 16:29:21 +0100
> Pierre Peiffer <[EMAIL PROTECTED]> wrote:
> 
>> Hi,
>>
>>  When I compile the kernel 2.6.24-mm1 with:
>> CONFIG_NET_ISA=y
>> CONFIG_NE2000=y
>>
>> I have the following compile error:
>> ...
>>   GEN .version
>>   CHK include/linux/compile.h
>>   UPD include/linux/compile.h
>>   CC  init/version.o
>>   LD  init/built-in.o
>>   LD  .tmp_vmlinux1
>> drivers/built-in.o: In function `ne_block_output':
>> linux-2.6.24-mm1/drivers/net/ne.c:797: undefined reference to `NS8390_init'
>> drivers/built-in.o: In function `ne_drv_resume':
>> linux-2.6.24-mm1/drivers/net/ne.c:858: undefined reference to `NS8390_init'
>> drivers/built-in.o: In function `ne_probe1':
>> linux-2.6.24-mm1/drivers/net/ne.c:539: undefined reference to `NS8390_init'
>> make[1]: *** [.tmp_vmlinux1] Error 1
>> make: *** [sub-make] Error 2
> 
> Thanks for reporting this.
> 
>> As I saw that the file 8390p.c is compiled for this driver, but not the file 
>> 8390.c which contains this function NS8390_init(), I fixed this error with
>> the following patch.
> 
> Alan's
> 8390-split-8390-support-into-a-pausing-and-a-non-pausing-driver-core.patch
> would be a prime suspect.  I assume this bug isn't present ing mainline or
> in 2.6.24?
> 
>> As NS8390p_init() does the same thing than NS8390_init(), I suppose that 
>> this is the right fix ?
>>
>> Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
>> ---
>>  drivers/net/ne.c |6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> Index: b/drivers/net/ne.c
>> ===
>> --- a/drivers/net/ne.c
>> +++ b/drivers/net/ne.c
>> @@ -536,7 +536,7 @@ static int __init ne_probe1(struct net_d
>>  #ifdef CONFIG_NET_POLL_CONTROLLER
>>  dev->poll_controller = eip_poll;
>>  #endif
>> -NS8390_init(dev, 0);
>> +NS8390p_init(dev, 0);
>>  
>>  ret = register_netdev(dev);
>>  if (ret)
>> @@ -794,7 +794,7 @@ retry:
>>  if (time_after(jiffies, dma_start + 2*HZ/100)) {
>> /* 20ms */
>>  printk(KERN_WARNING "%s: timeout waiting for Tx 
>> RDC.\n", dev->name);
>>  ne_reset_8390(dev);
>> -    NS8390_init(dev,1);
>> +NS8390p_init(dev,1);
>>  break;
>>  }
>>  
>> @@ -855,7 +855,7 @@ static int ne_drv_resume(struct platform
>>  
>>  if (netif_running(dev)) {
>>  ne_reset_8390(dev);
>> -NS8390_init(dev, 1);
>> +NS8390p_init(dev, 1);
>>  netif_device_attach(dev);
>>  }
>>  return 0;
> 
> 
> 
> 

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-mm1 0/8] (resend) IPC: code rewrite

2008-02-15 Thread Pierre Peiffer



Andi Kleen wrote:
> [EMAIL PROTECTED] writes:
> 
>>  This is a resend of the first part of the patchset sent 2 weeks
>> ago. This is the part about the IPC which (again) proposes to consolidate
>> some part of the existing code.
>>
>>  It does not change the behavior of the existing code, but
>> improves it in term of readability and maintainability as it consolidates it
>> a little. As there was no objection, I think you can include them in your 
>> -mm tree.
>>
>>  The patchset applies on top of "2.6.24-mm1 + previous patches about
>> IPC" sent the last days (ie Nadia's patches + mine).
> 
> While I have not read everything in detail from a quick overview
> the whole patch series looks like a nice and valuable cleanup to me.
> 
> I was a bit sceptical about all the interface enhancements your original
> patchkit had, but this one looks just fine.
> 

Thanks Andi for spending time on this review.
All kind of comments (positive or negative) are always welcome to make progress,
but of course, I particularly appreciate such positive feedbacks ;)

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-mm1 1/8] (resend) IPC/semaphores: code factorisation

2008-02-13 Thread Pierre PEIFFER

On Feb 13, 2008 9:07 PM, Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 12, 2008 at 05:13:41PM +0100, [EMAIL PROTECTED] wrote:
> > Trivial patch which adds some small locking functions and makes use of them
> > to factorize some part of the code and to make it cleaner.
>
> What's wrong with consolidation activity in general is that one need to
> follow tags many times to realise what on earth function really does.

Funny...
What's right with consolidation in general is that it avoids the
readers to read again and again the same piece of code and helps them
to focus on what the code really does.

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-mm1 8/8] (resend) IPC: consolidate all xxxctl_down() functions

2008-02-12 Thread pierre . peiffer

semctl_down(), msgctl_down() and shmctl_down() are used to handle the same
set of commands for each kind of IPC. They all start to do the same job (they
retrieve the ipc and do some permission checks) before handling the commands
on their own.

This patch proposes to consolidate this by moving these same pieces of code
into one common function called ipcctl_pre_down().
It simplifies a little these xxxctl_down() functions and increases a little
the maintainability.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 ipc/msg.c  |   48 +---
 ipc/sem.c  |   42 --
 ipc/shm.c  |   42 --
 ipc/util.c |   51 +++
 ipc/util.h |2 ++
 5 files changed, 66 insertions(+), 119 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -141,21 +141,6 @@ void __init sem_init (void)
 }
 
 /*
- * This routine is called in the paths where the rw_mutex is held to protect
- * access to the idr tree.
- */
-static inline struct sem_array *sem_lock_check_down(struct ipc_namespace *ns,
-   int id)
-{
-   struct kern_ipc_perm *ipcp = ipc_lock_check_down(&sem_ids(ns), id);
-
-   if (IS_ERR(ipcp))
-   return (struct sem_array *)ipcp;
-
-   return container_of(ipcp, struct sem_array, sem_perm);
-}
-
-/*
  * sem_lock_(check_) routines are called in the paths where the rw_mutex
  * is not held.
  */
@@ -878,31 +863,12 @@ static int semctl_down(struct ipc_namesp
if (copy_semid_from_user(&semid64, arg.buf, version))
return -EFAULT;
}
-   down_write(&sem_ids(ns).rw_mutex);
-   sma = sem_lock_check_down(ns, semid);
-   if (IS_ERR(sma)) {
-   err = PTR_ERR(sma);
-   goto out_up;
-   }
-
-   ipcp = &sma->sem_perm;
 
-   err = audit_ipc_obj(ipcp);
-   if (err)
-   goto out_unlock;
+   ipcp = ipcctl_pre_down(&sem_ids(ns), semid, cmd, &semid64.sem_perm, 0);
+   if (IS_ERR(ipcp))
+   return PTR_ERR(ipcp);
 
-   if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(0, semid64.sem_perm.uid,
-semid64.sem_perm.gid,
-semid64.sem_perm.mode);
-   if (err)
-   goto out_unlock;
-   }
-   if (current->euid != ipcp->cuid && 
-   current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) {
-   err=-EPERM;
-   goto out_unlock;
-   }
+   sma = container_of(ipcp, struct sem_array, sem_perm);
 
err = security_sem_semctl(sma, cmd);
if (err)
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -824,6 +824,57 @@ void ipc_update_perm(struct ipc64_perm *
| (in->mode & S_IRWXUGO);
 }
 
+/**
+ * ipcctl_pre_down - retrieve an ipc and check permissions for some IPC_XXX cmd
+ * @ids:  the table of ids where to look for the ipc
+ * @id:   the id of the ipc to retrieve
+ * @cmd:  the cmd to check
+ * @perm: the permission to set
+ * @extra_perm: one extra permission parameter used by msq
+ *
+ * This function does some common audit and permissions check for some IPC_XXX
+ * cmd and is called from semctl_down, shmctl_down and msgctl_down.
+ * It must be called without any lock held and
+ *  - retrieves the ipc with the given id in the given table.
+ *  - performs some audit and permission check, depending on the given cmd
+ *  - returns the ipc with both ipc and rw_mutex locks held in case of success
+ *or an err-code without any lock held otherwise.
+ */
+struct kern_ipc_perm *ipcctl_pre_down(struct ipc_ids *ids, int id, int cmd,
+ struct ipc64_perm *perm, int extrat_perm)
+{
+   struct kern_ipc_perm *ipcp;
+   int err;
+
+   down_write(&ids->rw_mutex);
+   ipcp = ipc_lock_check_down(ids, id);
+   if (IS_ERR(ipcp)) {
+   err = PTR_ERR(ipcp);
+   goto out_up;
+   }
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(extrat_perm, perm->uid,
+perm->gid, perm->mode);
+   if (err)
+   goto out_unlock;
+   }
+   if (current->euid == ipcp->cuid ||
+   current->euid == ipcp->uid || capable(CAP_SYS_ADMIN))
+   return ipcp;
+
+   err = -EPERM;
+out_unlock:
+   ipc_unlock(ipcp);
+out_up:
+   up_write(&ids-&

[PATCH 2.6.24-mm1 7/8] (resend) IPC: introduce ipc_update_perm()

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

The IPC_SET command performs the same permission setting for all IPCs.
This patch introduces a common ipc_update_perm() function to update these
permissions and makes use of it for all IPCs.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c  |5 +
 ipc/sem.c  |5 +
 ipc/shm.c  |5 +
 ipc/util.c |   13 +
 ipc/util.h |1 +
 5 files changed, 17 insertions(+), 12 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -484,10 +484,7 @@ static int msgctl_down(struct ipc_namesp
 
msq->q_qbytes = msqid64.msg_qbytes;
 
-   ipcp->uid = msqid64.msg_perm.uid;
-   ipcp->gid = msqid64.msg_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
-(S_IRWXUGO & msqid64.msg_perm.mode);
+   ipc_update_perm(&msqid64.msg_perm, ipcp);
msq->q_ctime = get_seconds();
/* sleeping receivers might be excluded by
 * stricter permissions.
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -913,10 +913,7 @@ static int semctl_down(struct ipc_namesp
freeary(ns, ipcp);
goto out_up;
case IPC_SET:
-   ipcp->uid = semid64.sem_perm.uid;
-   ipcp->gid = semid64.sem_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
-   | (semid64.sem_perm.mode & S_IRWXUGO);
+   ipc_update_perm(&semid64.sem_perm, ipcp);
sma->sem_ctime = get_seconds();
break;
default:
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -657,10 +657,7 @@ static int shmctl_down(struct ipc_namesp
do_shm_rmid(ns, ipcp);
goto out_up;
case IPC_SET:
-   ipcp->uid = shmid64.shm_perm.uid;
-   ipcp->gid = shmid64.shm_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
-   | (shmid64.shm_perm.mode & S_IRWXUGO);
+   ipc_update_perm(&shmid64.shm_perm, ipcp);
shp->shm_ctim = get_seconds();
break;
default:
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -811,6 +811,19 @@ int ipcget(struct ipc_namespace *ns, str
return ipcget_public(ns, ids, ops, params);
 }
 
+/**
+ * ipc_update_perm - update the permissions of an IPC.
+ * @in:  the permission given as input.
+ * @out: the permission of the ipc to set.
+ */
+void ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out)
+{
+   out->uid = in->uid;
+   out->gid = in->gid;
+   out->mode = (out->mode & ~S_IRWXUGO)
+   | (in->mode & S_IRWXUGO);
+}
+
 #ifdef __ARCH_WANT_IPC_PARSE_VERSION
 
 
Index: b/ipc/util.h
===
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -112,6 +112,7 @@ struct kern_ipc_perm *ipc_lock(struct ip
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
+void ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
 
 #if defined(__ia64__) || defined(__x86_64__) || defined(__hppa__) || 
defined(__XTENSA__)
   /* On IA-64, we always use the "64-bit version" of the IPC structures.  */ 

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-mm1 6/8] (resend) IPC: get rid of the use *_setbuf structure.

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

All IPCs make use of an intermetiate *_setbuf structure to handle the
IPC_SET command. This is not really needed and, moreover, it complicates
a little bit the code.

This patch get rid of the use of it and uses directly the semid64_ds/
msgid64_ds/shmid64_ds structure.

In addition of removing one struture declaration, it also simplifies
and improves a little bit the common 64-bits path.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c |   51 ++-
 ipc/sem.c |   40 ++--
 ipc/shm.c |   41 ++---
 3 files changed, 46 insertions(+), 86 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -388,31 +388,14 @@ copy_msqid_to_user(void __user *buf, str
}
 }
 
-struct msq_setbuf {
-   unsigned long   qbytes;
-   uid_t   uid;
-   gid_t   gid;
-   mode_t  mode;
-};
-
 static inline unsigned long
-copy_msqid_from_user(struct msq_setbuf *out, void __user *buf, int version)
+copy_msqid_from_user(struct msqid64_ds *out, void __user *buf, int version)
 {
switch(version) {
case IPC_64:
-   {
-   struct msqid64_ds tbuf;
-
-   if (copy_from_user(&tbuf, buf, sizeof(tbuf)))
+   if (copy_from_user(out, buf, sizeof(*out)))
return -EFAULT;
-
-   out->qbytes = tbuf.msg_qbytes;
-   out->uid= tbuf.msg_perm.uid;
-   out->gid= tbuf.msg_perm.gid;
-   out->mode   = tbuf.msg_perm.mode;
-
return 0;
-   }
case IPC_OLD:
{
struct msqid_ds tbuf_old;
@@ -420,14 +403,14 @@ copy_msqid_from_user(struct msq_setbuf *
if (copy_from_user(&tbuf_old, buf, sizeof(tbuf_old)))
return -EFAULT;
 
-   out->uid= tbuf_old.msg_perm.uid;
-   out->gid= tbuf_old.msg_perm.gid;
-   out->mode   = tbuf_old.msg_perm.mode;
+   out->msg_perm.uid   = tbuf_old.msg_perm.uid;
+   out->msg_perm.gid   = tbuf_old.msg_perm.gid;
+   out->msg_perm.mode  = tbuf_old.msg_perm.mode;
 
if (tbuf_old.msg_qbytes == 0)
-   out->qbytes = tbuf_old.msg_lqbytes;
+   out->msg_qbytes = tbuf_old.msg_lqbytes;
else
-   out->qbytes = tbuf_old.msg_qbytes;
+   out->msg_qbytes = tbuf_old.msg_qbytes;
 
return 0;
}
@@ -445,12 +428,12 @@ static int msgctl_down(struct ipc_namesp
   struct msqid_ds __user *buf, int version)
 {
struct kern_ipc_perm *ipcp;
-   struct msq_setbuf setbuf;
+   struct msqid64_ds msqid64;
struct msg_queue *msq;
int err;
 
if (cmd == IPC_SET) {
-   if (copy_msqid_from_user(&setbuf, buf, version))
+   if (copy_msqid_from_user(&msqid64, buf, version))
return -EFAULT;
}
 
@@ -468,8 +451,10 @@ static int msgctl_down(struct ipc_namesp
goto out_unlock;
 
if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
-setbuf.mode);
+   err = audit_ipc_set_perm(msqid64.msg_qbytes,
+msqid64.msg_perm.uid,
+msqid64.msg_perm.gid,
+msqid64.msg_perm.mode);
if (err)
goto out_unlock;
}
@@ -491,18 +476,18 @@ static int msgctl_down(struct ipc_namesp
freeque(ns, ipcp);
goto out_up;
case IPC_SET:
-   if (setbuf.qbytes > ns->msg_ctlmnb &&
+   if (msqid64.msg_qbytes > ns->msg_ctlmnb &&
!capable(CAP_SYS_RESOURCE)) {
err = -EPERM;
goto out_unlock;
}
 
-   msq->q_qbytes = setbuf.qbytes;
+   msq->q_qbytes = msqid64.msg_qbytes;
 
-   ipcp->uid = setbuf.uid;
-   ipcp->gid = setbuf.gid;
+   ipcp->uid = msqid64.msg_perm.uid;
+   ipcp->gid = msqid64.msg_perm.gid;
ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
-(S_IRWXUGO & setbuf.mode);
+(S_IRWXUGO & msqid64.msg_perm.mode);
msq->

[PATCH 2.6.24-mm1 5/8] (resend) IPC/semaphores: remove one unused parameter from semctl_down()

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

semctl_down() takes one unused parameter: semnum.
This patch proposes to get rid of it.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 ipc/sem.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -880,8 +880,8 @@ static inline unsigned long copy_semid_f
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
-   int cmd, int version, union semun arg)
+static int semctl_down(struct ipc_namespace *ns, int semid,
+  int cmd, int version, union semun arg)
 {
struct sem_array *sma;
int err;
@@ -972,7 +972,7 @@ asmlinkage long sys_semctl (int semid, i
return err;
case IPC_RMID:
case IPC_SET:
-   err = semctl_down(ns,semid,semnum,cmd,version,arg);
+   err = semctl_down(ns, semid, cmd, version, arg);
return err;
default:
return -EINVAL;

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-mm1 3/8] (resend) IPC/message queues: introduce msgctl_down

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Currently, sys_msgctl is not easy to read.
This patch tries to improve that by introducing the msgctl_down function
to handle all commands requiring the rwmutex to be taken in write mode
(ie IPC_SET and IPC_RMID for now). It is the equivalent function of
semctl_down for message queues.

This greatly changes the readability of sys_msgctl and also harmonizes
the way these commands are handled among all IPCs.


Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c |  162 ++
 1 file changed, 89 insertions(+), 73 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -436,10 +436,95 @@ copy_msqid_from_user(struct msq_setbuf *
}
 }
 
-asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
+/*
+ * This function handles some msgctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
+static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
+  struct msqid_ds __user *buf, int version)
 {
struct kern_ipc_perm *ipcp;
-   struct msq_setbuf uninitialized_var(setbuf);
+   struct msq_setbuf setbuf;
+   struct msg_queue *msq;
+   int err;
+
+   if (cmd == IPC_SET) {
+   if (copy_msqid_from_user(&setbuf, buf, version))
+   return -EFAULT;
+   }
+
+   down_write(&msg_ids(ns).rw_mutex);
+   msq = msg_lock_check_down(ns, msqid);
+   if (IS_ERR(msq)) {
+   err = PTR_ERR(msq);
+   goto out_up;
+   }
+
+   ipcp = &msq->q_perm;
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
+setbuf.mode);
+   if (err)
+   goto out_unlock;
+   }
+
+   if (current->euid != ipcp->cuid &&
+   current->euid != ipcp->uid &&
+   !capable(CAP_SYS_ADMIN)) {
+   /* We _could_ check for CAP_CHOWN above, but we don't */
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   err = security_msg_queue_msgctl(msq, cmd);
+   if (err)
+   goto out_unlock;
+
+   switch (cmd) {
+   case IPC_RMID:
+   freeque(ns, ipcp);
+   goto out_up;
+   case IPC_SET:
+   if (setbuf.qbytes > ns->msg_ctlmnb &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   msq->q_qbytes = setbuf.qbytes;
+
+   ipcp->uid = setbuf.uid;
+   ipcp->gid = setbuf.gid;
+   ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
+(S_IRWXUGO & setbuf.mode);
+   msq->q_ctime = get_seconds();
+   /* sleeping receivers might be excluded by
+* stricter permissions.
+*/
+   expunge_all(msq, -EAGAIN);
+   /* sleeping senders might be able to send
+* due to a larger queue size.
+*/
+   ss_wakeup(&msq->q_senders, 0);
+   break;
+   default:
+   err = -EINVAL;
+   }
+out_unlock:
+   msg_unlock(msq);
+out_up:
+   up_write(&msg_ids(ns).rw_mutex);
+   return err;
+}
+
+asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
+{
struct msg_queue *msq;
int err, version;
struct ipc_namespace *ns;
@@ -535,82 +620,13 @@ asmlinkage long sys_msgctl(int msqid, in
return success_return;
}
case IPC_SET:
-   if (!buf)
-   return -EFAULT;
-   if (copy_msqid_from_user(&setbuf, buf, version))
-   return -EFAULT;
-   break;
case IPC_RMID:
-   break;
+   err = msgctl_down(ns, msqid, cmd, buf, version);
+   return err;
default:
return  -EINVAL;
}
 
-   down_write(&msg_ids(ns).rw_mutex);
-   msq = msg_lock_check_down(ns, msqid);
-   if (IS_ERR(msq)) {
-   err = PTR_ERR(msq);
-   goto out_up;
-   }
-
-   ipcp = &msq->q_perm;
-
-   err = audit_ipc_obj(ipcp);
-   if (err)
-   goto out_unlock_up;
-   if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
-setbuf.mod

[PATCH 2.6.24-mm1 4/8] (resend) IPC/semaphores: move the rwmutex handling inside semctl_down

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

semctl_down is called with the rwmutex (the one which protects the
list of ipcs) taken in write mode.
This patch moves this rwmutex taken in write-mode inside semctl_down.
This has the advantages of reducing a little bit the window during
which this rwmutex is taken, clarifying sys_semctl, and finally of
having a coherent behaviour with [shm|msg]ctl_down

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/sem.c |   24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -875,6 +875,11 @@ static inline unsigned long copy_semid_f
}
 }
 
+/*
+ * This function handles some semctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
 static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
int cmd, int version, union semun arg)
 {
@@ -887,9 +892,12 @@ static int semctl_down(struct ipc_namesp
if(copy_semid_from_user (&setbuf, arg.buf, version))
return -EFAULT;
}
+   down_write(&sem_ids(ns).rw_mutex);
sma = sem_lock_check_down(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
+   if (IS_ERR(sma)) {
+   err = PTR_ERR(sma);
+   goto out_up;
+   }
 
ipcp = &sma->sem_perm;
 
@@ -915,26 +923,22 @@ static int semctl_down(struct ipc_namesp
switch(cmd){
case IPC_RMID:
freeary(ns, ipcp);
-   err = 0;
-   break;
+   goto out_up;
case IPC_SET:
ipcp->uid = setbuf.uid;
ipcp->gid = setbuf.gid;
ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
| (setbuf.mode & S_IRWXUGO);
sma->sem_ctime = get_seconds();
-   sem_unlock(sma);
-   err = 0;
break;
default:
-   sem_unlock(sma);
err = -EINVAL;
-   break;
}
-   return err;
 
 out_unlock:
sem_unlock(sma);
+out_up:
+   up_write(&sem_ids(ns).rw_mutex);
return err;
 }
 
@@ -968,9 +972,7 @@ asmlinkage long sys_semctl (int semid, i
return err;
case IPC_RMID:
case IPC_SET:
-   down_write(&sem_ids(ns).rw_mutex);
err = semctl_down(ns,semid,semnum,cmd,version,arg);
-   up_write(&sem_ids(ns).rw_mutex);
    return err;
default:
return -EINVAL;

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-mm1 2/8] (resend) IPC/shared memory: introduce shmctl_down

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Currently, the way the different commands are handled in sys_shmctl
introduces some duplicated code.
This patch introduces the shmctl_down function to handle all the commands
requiring the rwmutex to be taken in write mode (ie IPC_SET and IPC_RMID
for now). It is the equivalent function of semctl_down for shared
memory.

This removes some duplicated code for handling these both commands
and harmonizes the way they are handled among all IPCs.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/shm.c |  160 +++---
 1 file changed, 72 insertions(+), 88 deletions(-)

Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -617,10 +617,78 @@ static void shm_get_stat(struct ipc_name
}
 }
 
-asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
+/*
+ * This function handles some shmctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
+static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+  struct shmid_ds __user *buf, int version)
 {
+   struct kern_ipc_perm *ipcp;
struct shm_setbuf setbuf;
struct shmid_kernel *shp;
+   int err;
+
+   if (cmd == IPC_SET) {
+   if (copy_shmid_from_user(&setbuf, buf, version))
+   return -EFAULT;
+   }
+
+   down_write(&shm_ids(ns).rw_mutex);
+   shp = shm_lock_check_down(ns, shmid);
+   if (IS_ERR(shp)) {
+   err = PTR_ERR(shp);
+   goto out_up;
+   }
+
+   ipcp = &shp->shm_perm;
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(0, setbuf.uid,
+setbuf.gid, setbuf.mode);
+   if (err)
+   goto out_unlock;
+   }
+
+   if (current->euid != ipcp->uid &&
+   current->euid != ipcp->cuid &&
+   !capable(CAP_SYS_ADMIN)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   err = security_shm_shmctl(shp, cmd);
+   if (err)
+   goto out_unlock;
+   switch (cmd) {
+   case IPC_RMID:
+   do_shm_rmid(ns, ipcp);
+   goto out_up;
+   case IPC_SET:
+   ipcp->uid = setbuf.uid;
+   ipcp->gid = setbuf.gid;
+   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
+   | (setbuf.mode & S_IRWXUGO);
+   shp->shm_ctim = get_seconds();
+   break;
+   default:
+   err = -EINVAL;
+   }
+out_unlock:
+   shm_unlock(shp);
+out_up:
+   up_write(&shm_ids(ns).rw_mutex);
+   return err;
+}
+
+asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf)
+{
+   struct shmid_kernel *shp;
int err, version;
struct ipc_namespace *ns;
 
@@ -776,97 +844,13 @@ asmlinkage long sys_shmctl (int shmid, i
goto out;
}
case IPC_RMID:
-   {
-   /*
-*  We cannot simply remove the file. The SVID states
-*  that the block remains until the last person
-*  detaches from it, then is deleted. A shmat() on
-*  an RMID segment is legal in older Linux and if 
-*  we change it apps break...
-*
-*  Instead we set a destroyed flag, and then blow
-*  the name away when the usage hits zero.
-*/
-   down_write(&shm_ids(ns).rw_mutex);
-   shp = shm_lock_check_down(ns, shmid);
-   if (IS_ERR(shp)) {
-   err = PTR_ERR(shp);
-   goto out_up;
-   }
-
-   err = audit_ipc_obj(&(shp->shm_perm));
-   if (err)
-   goto out_unlock_up;
-
-   if (current->euid != shp->shm_perm.uid &&
-   current->euid != shp->shm_perm.cuid && 
-   !capable(CAP_SYS_ADMIN)) {
-   err=-EPERM;
-   goto out_unlock_up;
-   }
-
-   err = security_shm_shmctl(shp, cmd);
-   if (err)
-   goto out_unlock_up;
-
-   do_shm_rmid(ns, &shp->shm_perm);
-   up_write(&shm_ids(ns).rw_mutex);
-   goto out;
-   }
-
case IPC_SET:
-   {
-   if (!buf) {
-   err = -EFAULT;
-

[PATCH 2.6.24-mm1 1/8] (resend) IPC/semaphores: code factorisation

2008-02-12 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Trivial patch which adds some small locking functions and makes use of them
to factorize some part of the code and to make it cleaner.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/sem.c |   61 +++--
 1 file changed, 31 insertions(+), 30 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -180,6 +180,25 @@ static inline struct sem_array *sem_lock
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline void sem_lock_and_putref(struct sem_array *sma)
+{
+   ipc_lock_by_ptr(&sma->sem_perm);
+   ipc_rcu_putref(sma);
+}
+
+static inline void sem_getref_and_unlock(struct sem_array *sma)
+{
+   ipc_rcu_getref(sma);
+   ipc_unlock(&(sma)->sem_perm);
+}
+
+static inline void sem_putref(struct sem_array *sma)
+{
+   ipc_lock_by_ptr(&sma->sem_perm);
+   ipc_rcu_putref(sma);
+   ipc_unlock(&(sma)->sem_perm);
+}
+
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
 {
ipc_rmid(&sem_ids(ns), &s->sem_perm);
@@ -698,19 +717,15 @@ static int semctl_main(struct ipc_namesp
int i;
 
if(nsems > SEMMSL_FAST) {
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return -ENOMEM;
}
 
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
+   sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma);
err = -EIDRM;
@@ -731,38 +746,30 @@ static int semctl_main(struct ipc_namesp
int i;
struct sem_undo *un;
 
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
if(nsems > SEMMSL_FAST) {
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return -ENOMEM;
}
}
 
if (copy_from_user (sem_io, arg.array, nsems*sizeof(ushort))) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
err = -EFAULT;
goto out_free;
}
 
for (i = 0; i < nsems; i++) {
if (sem_io[i] > SEMVMX) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
err = -ERANGE;
goto out_free;
}
}
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
+   sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma);
err = -EIDRM;
@@ -1042,14 +1049,11 @@ static struct sem_undo *find_undo(struct
return ERR_PTR(PTR_ERR(sma));
 
nsems = sma->sem_nsems;
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, 
GFP_KERNEL);
if (!new) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return ERR_PTR(-ENOMEM);
}
new->semadj = (short *) &new[1];
@@ -1060,13 +1064,10 @@ static struct sem_undo *find_undo(struct
if (un) {
spin_unlock(&ulp->lock);
kfree(new);
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_u

[PATCH 2.6.24-mm1 0/8] (resend) IPC: code rewrite

2008-02-12 Thread pierre . peiffer

Hi Andrew,

This is a resend of the first part of the patchset sent 2 weeks
ago. This is the part about the IPC which (again) proposes to consolidate
some part of the existing code.

It does not change the behavior of the existing code, but
improves it in term of readability and maintainability as it consolidates it
a little. As there was no objection, I think you can include them in your 
-mm tree.

The patchset applies on top of "2.6.24-mm1 + previous patches about
IPC" sent the last days (ie Nadia's patches + mine).

For information, here is the global diffstat:

 ipc/msg.c  |  184 +++--
 ipc/sem.c  |  156 ++-
 ipc/shm.c  |  176 ++
 ipc/util.c |   64 +
 ipc/util.h |3 
 5 files changed, 249 insertions(+), 334 deletions(-)


and the size of the resulting kernel:

- without the patchset:
$ size obj/vmlinux.ori
   textdata bss dec hex filename
1903257  175820  122880 2201957  219965 obj/vmlinux.ori

- with the patchset:
$ size obj/vmlinux
   textdata bss dec hex filename
1902917  175820  122880 2201617  219811 obj/vmlinux


-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-mm1] IPC: use ipc_buildid() directly from ipc_addid()

2008-02-08 Thread Pierre Peiffer


Hi,

By continuing to consolidate a little the IPC code, each id can be built
directly in ipc_addid() instead of having it built from each callers of
ipc_addid()

And I also remove shm_addid() in order to have, as much as possible, the
same code for shm/sem/msg.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 ipc/msg.c  |2 --
 ipc/sem.c  |2 --
 ipc/shm.c  |   10 +-
 ipc/util.c |1 +
 4 files changed, 2 insertions(+), 13 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -70,7 +70,6 @@ struct msg_sender {
 #define msg_ids(ns)((ns)->ids[IPC_MSG_IDS])
 
 #define msg_unlock(msq)ipc_unlock(&(msq)->q_perm)
-#define msg_buildid(id, seq)   ipc_buildid(id, seq)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *);
@@ -186,7 +185,6 @@ static int newque(struct ipc_namespace *
return id;
}
 
-   msq->q_perm.id = msg_buildid(id, msq->q_perm.seq);
msq->q_stime = msq->q_rtime = 0;
msq->q_ctime = get_seconds();
msq->q_cbytes = msq->q_qnum = 0;
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -91,7 +91,6 @@
 
 #define sem_unlock(sma)ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)ipc_checkid(&sma->sem_perm, semid)
-#define sem_buildid(id, seq)   ipc_buildid(id, seq)
 
 static int newary(struct ipc_namespace *, struct ipc_params *);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
@@ -268,7 +267,6 @@ static int newary(struct ipc_namespace *
}
ns->used_sems += nsems;
 
-   sma->sem_perm.id = sem_buildid(id, sma->sem_perm.seq);
sma->sem_base = (struct sem *) &sma[1];
/* sma->sem_pending = NULL; */
sma->sem_pending_last = &sma->sem_pending;
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -60,7 +60,6 @@ static struct vm_operations_struct shm_v
 
 #define shm_unlock(shp)\
ipc_unlock(&(shp)->shm_perm)
-#define shm_buildid(id, seq)   ipc_buildid(id, seq)
 
 static int newseg(struct ipc_namespace *, struct ipc_params *);
 static void shm_open(struct vm_area_struct *vma);
@@ -169,12 +168,6 @@ static inline void shm_rmid(struct ipc_n
ipc_rmid(&shm_ids(ns), &s->shm_perm);
 }
 
-static inline int shm_addid(struct ipc_namespace *ns, struct shmid_kernel *shp)
-{
-   return ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
-}
-
-
 
 /* This is called by fork, once for every shm attach. */
 static void shm_open(struct vm_area_struct *vma)
@@ -417,7 +410,7 @@ static int newseg(struct ipc_namespace *
if (IS_ERR(file))
goto no_file;
 
-   id = shm_addid(ns, shp);
+   id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
if (id < 0) {
error = id;
goto no_id;
@@ -429,7 +422,6 @@ static int newseg(struct ipc_namespace *
shp->shm_ctim = get_seconds();
shp->shm_segsz = size;
shp->shm_nattch = 0;
-   shp->shm_perm.id = shm_buildid(id, shp->shm_perm.seq);
shp->shm_file = file;
/*
 * shmid gets reported as "inode#" in /proc/pid/maps.
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -231,6 +231,7 @@ int ipc_addid(struct ipc_ids* ids, struc
if(ids->seq > ids->seq_max)
    ids->seq = 0;
 
+   new->id =  ipc_buildid(id, new->seq);
spin_lock_init(&new->lock);
new->deleted = 0;
rcu_read_lock();

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 09/15] (RFC) IPC: new kernel API to change an ID

2008-02-08 Thread Pierre Peiffer



Serge E. Hallyn wrote:
> 
> But note that in either case we need to deal with a bunch of locking.
> So getting back to Pierre's patchset, IIRC 1-8 are cleanups worth
> doing no matter 1.  9-11 sound like they are contentuous until
> we decide whether we want to go with a create_with_id() type approach
> or a set_id().  12 is IMO a good locking cleanup regardless.  13 and
> 15 are contentous until we decide whether we want userspace-controlled
> checkpoint or a one-shot fs.  14 IMO is useful for both c/r approaches.
> 
> Is that pretty accurate?
> 

Ok, so, so far, the discussion stays opened about the new functionalities for 
c/r.

As there were no objection about the first patches, which rewrite/enhance the
existing code, Andrew, could you consider them (ie patches 1 to 8 of this
series) for inclusion in -mm ? (I mean, as soon as it is possible, as I guess
you're pretty busy for now with the merge for 2.6.25)

If you prefer, I can resend them separately ?

Thanks,

Pierre


>> It isn't strictly necessary to export a new interface in order to
>> support checkpoint/restart. **. Hence, I think that the speculation
>> "we may need it in the future" is too abstract and isn't a good
>> excuse to commit to a new, currently unneeded, interface.
> 
> OTOH it did succeed in starting some conversation :)
> 
>> Should the
>> need arise in the future, it will be easy to design a new interface
>> (also based on aggregated experience until then).
> 
> What aggregated experience?  We have to start somewhere...
> 
>> ** In fact, the suggested interface may prove problematic (as noted
>> earlier in this thread): if you first create the resource with some
>> arbitrary identifier and then modify the identifier (in our case,
>> IPC id), then the restart procedure is bound to execute sequentially,
>> because of lack of atomicity.
> 
> Hmm?  Lack of atomicity wrt what?  All the tasks being restarted were
> checkpointed at the same time so there will be no conflict in the
> requested IDs, so I don't know what you're referring to.
> 
>> That said, I suggest the following method instead (this is the method
>> we use in Zap to determine the desired resource identifier when a new
>> resource is allocated; I recall that we had discussed it in the past,
>> perhaps the mini-summit in september ?):
>>
>> 1) The process/thread tells the kernel that it wishes to pre-determine
>> the resource identifier of a subsequent call (this can be done via a
>> new syscall, or by writing to /proc/self/...).
>>
>> 2) Each system call that allocates a resource and assigns an identifier
>> is modified to check this per-thread field first; if it is set then
>> it will attempt to allocate that particular value (if already taken,
>> return an error, eg. EBUSY). Otherwise it will proceed as it is today.
> 
> But I thought you were just advocating a one-shot filesystem approach
> for c/r, so we wouldn't be creating the resources piecemeal?
> 
> The /proc/self approach is one way to go, it has been working for LSMs
> this long.  I'd agree that it would be nice if we could have a
> consistent interface to the create_with_id()/set_id() problem.  A first
> shot addressing ipcs and pids would be a great start.
> 
>> (I left out some details - eg. the kernel will keep the desire value
>> on a per-thread field, when it will be reset, whether we want to also
>> tag the field with its type and so on, but the idea is now clear).
>>
>> The main two advantages are that first, we don't need to devise a new
>> method for every syscall that allocates said resources (sigh... just
> 
> Agreed.
> 
>> think of clone() nightmare to add a new argument);
> 
> Yes, and then there will need to be the clone_with_pid() extension on
> top of that.
> 
>> second, the change
>> is incremental: first code the mechanism to set the field, then add
>> support in the IPC subsystem, later in the DEVPTS, then in clone and
>> so forth.
>>
>> Oren.
>>
>> Pierre Peiffer wrote:
>>> Kirill Korotaev wrote:
>>>> Why user space can need this API? for checkpointing only?
>>> I would say "at least for checkpointing"... ;) May be someone else may 
>>> find an
>>> interest about this for something else.
>>> In fact, I'm sure that you have some interest in checkpointing; and thus, 
>>> you
>>> have probably some ideas in mind; but whatever the solution you will 
>>> propose,
>>> I'm pretty sure that I could say the same thing for your solution.
>>> And what I finally think is: even

[PATCH 2.6.24-mm1] error compiling net driver NE2000/NE1000

2008-02-04 Thread Pierre Peiffer

Hi,

When I compile the kernel 2.6.24-mm1 with:
CONFIG_NET_ISA=y
CONFIG_NE2000=y

I have the following compile error:
...
  GEN .version
  CHK include/linux/compile.h
  UPD include/linux/compile.h
  CC  init/version.o
  LD  init/built-in.o
  LD  .tmp_vmlinux1
drivers/built-in.o: In function `ne_block_output':
linux-2.6.24-mm1/drivers/net/ne.c:797: undefined reference to `NS8390_init'
drivers/built-in.o: In function `ne_drv_resume':
linux-2.6.24-mm1/drivers/net/ne.c:858: undefined reference to `NS8390_init'
drivers/built-in.o: In function `ne_probe1':
linux-2.6.24-mm1/drivers/net/ne.c:539: undefined reference to `NS8390_init'
make[1]: *** [.tmp_vmlinux1] Error 1
make: *** [sub-make] Error 2

As I saw that the file 8390p.c is compiled for this driver, but not the file 
8390.c which contains this function NS8390_init(), I fixed this error with
the following patch.

As NS8390p_init() does the same thing than NS8390_init(), I suppose that this 
is the right fix ?

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 drivers/net/ne.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: b/drivers/net/ne.c
===
--- a/drivers/net/ne.c
+++ b/drivers/net/ne.c
@@ -536,7 +536,7 @@ static int __init ne_probe1(struct net_d
 #ifdef CONFIG_NET_POLL_CONTROLLER
dev->poll_controller = eip_poll;
 #endif
-   NS8390_init(dev, 0);
+   NS8390p_init(dev, 0);
 
ret = register_netdev(dev);
if (ret)
@@ -794,7 +794,7 @@ retry:
if (time_after(jiffies, dma_start + 2*HZ/100)) {
/* 20ms */
printk(KERN_WARNING "%s: timeout waiting for Tx 
RDC.\n", dev->name);
ne_reset_8390(dev);
-   NS8390_init(dev,1);
+   NS8390p_init(dev,1);
break;
}
 
@@ -855,7 +855,7 @@ static int ne_drv_resume(struct platform
 
if (netif_running(dev)) {
ne_reset_8390(dev);
-   NS8390_init(dev, 1);
+   NS8390p_init(dev, 1);
netif_device_attach(dev);
}
return 0;

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 00/15] IPC: code rewrite + new functionalities

2008-02-04 Thread Pierre Peiffer



Pavel Machek wrote:
> Hi!
> 
>> * Patches 9 to 15 propose to add some functionalities, and thus are
>> submitted here for RFC, about both the interest and their implementation.
>> These functionalities are:
>> - Two new control-commands:
>>  . IPC_SETID: to change an IPC's id.
>>  . IPC_SETALL: behaves as IPC_SET, except that it also sets all time
>>and pid values)
>> - add a /proc//semundo file to read and write the undo values of
>> some semaphores for a given process.
>>
>>  As the namespaces and the "containers" are being integrated in the
>> kernel, these functionalities may be a first step to implement  the
>> checkpoint/restart of an application: in fact the existing API does not allow
>> to specify or to change an ID when creating an IPC, when restarting an
>> application, and the times/pids values of each IPCs are also altered. May be
>> someone may find another interest about this ?
>>
>> So again, comments are welcome.
> 
> Checkpoint/restart is nice, but... sysV ipc is broken by design, do we
> really want to extend it?

If we want to support all kind of applications, yes, we must also support
SysVipc. We must support all kernel subsystems at the end.
I've started with IPC, because it's relatively simple and isolated.


-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 14/15] (RFC) IPC/semaphores: prepare semundo code to work on another task than current

2008-02-01 Thread Pierre Peiffer



Serge E. Hallyn wrote:
> Quoting Pierre Peiffer ([EMAIL PROTECTED]):
>>
>> Serge E. Hallyn wrote:
>>> Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
>>>> From: Pierre Peiffer <[EMAIL PROTECTED]>
>>>>
>>>> In order to modify the semundo-list of a task from procfs, we must be able 
>>>> to
>>>> work on any target task.
>>>> But all the existing code playing with the semundo-list, currently works
>>>> only on the 'current' task, and does not allow to specify any target task.
>>>>
>>>> This patch changes all these routines to allow them to work on a specified
>>>> task, passed in parameter, instead of current.
>>>>
>>>> This is mainly a preparation for the semundo_write() operation, on the
>>>> /proc//semundo file, as provided in the next patch.
>>>>
>>>> Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
>>>> ---
>>>>
>>>>  ipc/sem.c |   90 
>>>> ++
>>>>  1 file changed, 68 insertions(+), 22 deletions(-)
>>>>
>>>> Index: b/ipc/sem.c
>>>> ===
>>>> --- a/ipc/sem.c
>>>> +++ b/ipc/sem.c
>>>> @@ -1017,8 +1017,9 @@ asmlinkage long sys_semctl (int semid, i
>>>>  }
>>>>
>>>>  /* If the task doesn't already have a undo_list, then allocate one
>>>> - * here.  We guarantee there is only one thread using this undo list,
>>>> - * and current is THE ONE
>>>> + * here.
>>>> + * The target task (tsk) is current in the general case, except when
>>>> + * accessed from the procfs (ie when writting to /proc//semundo)
>>>>   *
>>>>   * If this allocation and assignment succeeds, but later
>>>>   * portions of this code fail, there is no need to free the sem_undo_list.
>>>> @@ -1026,22 +1027,60 @@ asmlinkage long sys_semctl (int semid, i
>>>>   * at exit time.
>>>>   *
>>>>   * This can block, so callers must hold no locks.
>>>> + *
>>>> + * Note: task_lock is used to synchronize 1. several possible concurrent
>>>> + * creations and 2. the free of the undo_list (done when the task using it
>>>> + * exits). In the second case, we check the PF_EXITING flag to not create
>>>> + * an undo_list for a task which has exited.
>>>> + * If there already is an undo_list for this task, there is no need
>>>> + * to held the task-lock to retrieve it, as the pointer can not change
>>>> + * afterwards.
>>>>   */
>>>> -static inline int get_undo_list(struct sem_undo_list **undo_listp)
>>>> +static inline int get_undo_list(struct task_struct *tsk,
>>>> +  struct sem_undo_list **ulp)
>>>>  {
>>>> -  struct sem_undo_list *undo_list;
>>>> +  if (tsk->sysvsem.undo_list == NULL) {
>>>> +  struct sem_undo_list *undo_list;
>>> Hmm, this is weird.  If there was no undo_list and
>>> tsk!=current, you set the refcnt to 2.  But if there was an
>>> undo list and tsk!=current, where do you inc the refcnt?
>>>
>> I inc it  outside this function, as I don't call get_undo_list() if there is 
>> an
>> undo_list.
>> This appears most clearly in the next patch, in semundo_open() for example.
> 
> Ok, so however unlikely, there is a flow that could cause you a problem:
> T2 calls semundo_open() for T1.  T1 does not yet have a semundolist.
> T2.semundo_open() calls get_undo_list, just then T1 creats its own
> semundo_list.  T2 comes to top of get_undo_list() and see
> tsk->sysvsem.undo_list != NULL, simply returns a pointer to the
> undo_list.  Now you never increment the count.
> 
Right.

And yesterday, with more testing in the corners, I've found another issue: if I
use /proc/self/semundo, I don't have tsk != current and the refcnt is wrong too.

Thanks for finding this !

P.

>>>> -  undo_list = current->sysvsem.undo_list;
>>>> -  if (!undo_list) {
>>>> -  undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
>>>> +  /* we must alloc a new one */
>>>> +  undo_list = kmalloc(sizeof(*undo_list), GFP_KERNEL);
>>>>if (undo_list == NULL)
>>>>return -ENOMEM;
>>>> +
>>>> +  task_lock(tsk);
>

Re: [PATCH 2.6.24-rc8-mm1 09/15] (RFC) IPC: new kernel API to change an ID

2008-01-31 Thread Pierre Peiffer

Kirill Korotaev wrote:
> Why user space can need this API? for checkpointing only?

I would say "at least for checkpointing"... ;) May be someone else may find an
interest about this for something else.
In fact, I'm sure that you have some interest in checkpointing; and thus, you
have probably some ideas in mind; but whatever the solution you will propose,
I'm pretty sure that I could say the same thing for your solution.
And what I finally think is: even if it's for "checkpointing only", if many
people are interested by this, it may be sufficient to push this ?

> Then I would not consider it for inclusion until it is clear how to implement 
> checkpointing.
> As for me personally - I'm against exporting such APIs, since they are not 
> needed in real-life user space applications and maintaining it forever for 
> compatibility doesn't worth it.

Maintaining these patches is not a big deal, really, but this is not the main
point; the "need in real life" (1) is in fact the main one, and then, the "is
this solution the best one ?" (2) the second one.

About (1), as said in my first mail, as the namespaces and containers are being
integrated into the mainline kernel, checkpoint/restart is (or will be) the next
need.
About (2), my solution propose to do that, as much as possible from userspace,
to minimize the kernel impact. Of course, this is subject to discussion. My
opinion is that doing a full checkpoint/restart from kernel space will need lot
of new specific and intrusive code; I'm not sure that this will be acceptable by
the community. But this is my opinion only. Discusion is opened.

> Also such APIs allow creation of non-GPL checkpointing in user-space, which 
> can be of concern as well.

Honestly, I don't think this really a concern at all. I mean: I've never seen
"this allows non-GPL binary and thus, this is bad" as an argument to reject a
functionality, but I may be wrong, and thus, it can be discussed as well.
I think the points (1) and (2) as stated above are the key ones.

Pierre

> Kirill
> 
> 
> Pierre Peiffer wrote:
>> Hi again,
>>
>>  Thinking more about this, I think I must clarify why I choose this way.
>> In fact, the idea of these patches is to provide the missing user APIs (or
>> extend the existing ones) that allow to set or update _all_ properties of all
>> IPCs, as needed in the case of the checkpoint/restart of an application (the
>> current user API does not allow to specify an ID for a created IPC, for
>> example). And this, without changing the existing API of course.
>>
>>  And msgget(), semget() and shmget() does not have any parameter we can 
>> use to
>> specify an ID.
>>  That's why I've decided to not change these routines and add a new 
>> control
>> command, IP_SETID, with which we can can change the ID of an IPC. (that 
>> looks to
>> me more straightforward and logical)
>>
>>  Now, this patch is, in fact, only a preparation for the patch 10/15 
>> which
>> really complete the user API by adding this IPC_SETID command.
>>
>> (... continuing below ...)
>>
>> Alexey Dobriyan wrote:
>>> On Tue, Jan 29, 2008 at 05:02:38PM +0100, [EMAIL PROTECTED] wrote:
>>>> This patch provides three new API to change the ID of an existing
>>>> System V IPCs.
>>>>
>>>> These APIs are:
>>>>long msg_chid(struct ipc_namespace *ns, int id, int newid);
>>>>long sem_chid(struct ipc_namespace *ns, int id, int newid);
>>>>long shm_chid(struct ipc_namespace *ns, int id, int newid);
>>>>
>>>> They return 0 or an error code in case of failure.
>>>>
>>>> They may be useful for setting a specific ID for an IPC when preparing
>>>> a restart operation.
>>>>
>>>> To be successful, the following rules must be respected:
>>>> - the IPC exists (of course...)
>>>> - the new ID must satisfy the ID computation rule.
>>>> - the entry in the idr corresponding to the new ID must be free.
>>>>  ipc/util.c  |   48 
>>>> 
>>>>  ipc/util.h  |1 +
>>>>  8 files changed, 197 insertions(+)
>>> For the record, OpenVZ uses "create with predefined ID" method which
>>> leads to less code. For example, change at the end is all we want from
>>> ipc/util.c .
>> And in fact, you do that from kernel space, you don't have the constraint to 
>> fit
>> the existing user API.
>> Again, this patch, even if it presents a new kernel API, is in fact a
>> preparation for the next patch which introduces a new user API.
>>
>> Do you think that this could fit your need ?
>>
> 
> 

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 05/15] IPC/semaphores: remove one unused parameter from semctl_down()

2008-01-31 Thread Pierre Peiffer



Nadia Derbey wrote:
> [EMAIL PROTECTED] wrote:
>> From: Pierre Peiffer <[EMAIL PROTECTED]>
>>
>> semctl_down() takes one unused parameter: semnum.
>> This patch proposes to get rid of it.
>>
>> Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
>> Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
>> ---
>>  ipc/sem.c |6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> Index: b/ipc/sem.c
>> ===
>> --- a/ipc/sem.c
>> +++ b/ipc/sem.c
>> @@ -882,8 +882,8 @@ static inline unsigned long copy_semid_f
>>   * to be held in write mode.
>>   * NOTE: no locks must be held, the rw_mutex is taken inside this
>> function.
>>   */
>> -static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
>> -int cmd, int version, union semun arg)
>> +static int semctl_down(struct ipc_namespace *ns, int semid,
>> +   int cmd, int version, union semun arg)
>>  {
>>  struct sem_array *sma;
>>  int err;
>> @@ -974,7 +974,7 @@ asmlinkage long sys_semctl (int semid, i
>>  return err;
>>  case IPC_RMID:
>>  case IPC_SET:
>> -err = semctl_down(ns,semid,semnum,cmd,version,arg);
>> +err = semctl_down(ns, semid, cmd, version, arg);
>>  return err;
>>  default:
>>  return -EINVAL;
>>
> 
> Looks like semnum is only used in semctl_main(). Why not removing it
> from semctl_nolock() too?

Indeed.
In fact, I already fixed that in a previous patch, included in -mm since kernel
2.6.24.rc3-mm2 (patch named ipc-semaphores-consolidate-sem_stat-and.patch)

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 12/15] (RFC) IPC/semaphores: make use of RCU to free the sem_undo_list

2008-01-31 Thread Pierre Peiffer



Serge E. Hallyn wrote:
> Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
>> From: Pierre Peiffer <[EMAIL PROTECTED]>
>>
>> Today, the sem_undo_list is freed when the last task using it exits.
>> There is no mechanism in place, that allows a safe concurrent access to
>> the sem_undo_list of a target task and protects efficiently against a
>> task-exit.
>>
>> That is okay for now as we don't need this.
>>
>> As I would like to provide a /proc interface to access this data, I need
>> such a safe access, without blocking the target task if possible. 
>>
>> This patch proposes to introduce the use of RCU to delay the real free of
>> these sem_undo_list structures. They can then be accessed in a safe manner
>> by any tasks inside read critical section, this way:
>>
>>  struct sem_undo_list *undo_list;
>>  int ret;
>>  ...
>>  rcu_read_lock();
>>  undo_list = rcu_dereference(task->sysvsem.undo_list);
>>  if (undo_list)
>>  ret = atomic_inc_not_zero(&undo_list->refcnt);
>>  rcu_read_unlock();
>>  ...
>>  if (undo_list && ret) {
>>  /* section where undo_list can be used quietly */
>>  ...
>>  }
>>  ...
> 
> And of course then
> 
>   if (atomic_dec_and_test(&undo_list->refcnt))
>   free_semundo_list(undo_list);
> 
> by that task.
> 

I will precise this too.

>> Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
> 
> Looks correct in terms of locking/refcounting.
> 
> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
> 

Thanks !

-- 
Pierre

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 14/15] (RFC) IPC/semaphores: prepare semundo code to work on another task than current

2008-01-31 Thread Pierre Peiffer



Serge E. Hallyn wrote:
> Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
>> From: Pierre Peiffer <[EMAIL PROTECTED]>
>>
>> In order to modify the semundo-list of a task from procfs, we must be able to
>> work on any target task.
>> But all the existing code playing with the semundo-list, currently works
>> only on the 'current' task, and does not allow to specify any target task.
>>
>> This patch changes all these routines to allow them to work on a specified
>> task, passed in parameter, instead of current.
>>
>> This is mainly a preparation for the semundo_write() operation, on the
>> /proc//semundo file, as provided in the next patch.
>>
>> Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
>> ---
>>
>>  ipc/sem.c |   90 
>> ++
>>  1 file changed, 68 insertions(+), 22 deletions(-)
>>
>> Index: b/ipc/sem.c
>> ===
>> --- a/ipc/sem.c
>> +++ b/ipc/sem.c
>> @@ -1017,8 +1017,9 @@ asmlinkage long sys_semctl (int semid, i
>>  }
>>
>>  /* If the task doesn't already have a undo_list, then allocate one
>> - * here.  We guarantee there is only one thread using this undo list,
>> - * and current is THE ONE
>> + * here.
>> + * The target task (tsk) is current in the general case, except when
>> + * accessed from the procfs (ie when writting to /proc//semundo)
>>   *
>>   * If this allocation and assignment succeeds, but later
>>   * portions of this code fail, there is no need to free the sem_undo_list.
>> @@ -1026,22 +1027,60 @@ asmlinkage long sys_semctl (int semid, i
>>   * at exit time.
>>   *
>>   * This can block, so callers must hold no locks.
>> + *
>> + * Note: task_lock is used to synchronize 1. several possible concurrent
>> + * creations and 2. the free of the undo_list (done when the task using it
>> + * exits). In the second case, we check the PF_EXITING flag to not create
>> + * an undo_list for a task which has exited.
>> + * If there already is an undo_list for this task, there is no need
>> + * to held the task-lock to retrieve it, as the pointer can not change
>> + * afterwards.
>>   */
>> -static inline int get_undo_list(struct sem_undo_list **undo_listp)
>> +static inline int get_undo_list(struct task_struct *tsk,
>> +struct sem_undo_list **ulp)
>>  {
>> -struct sem_undo_list *undo_list;
>> +if (tsk->sysvsem.undo_list == NULL) {
>> +struct sem_undo_list *undo_list;
> 
> Hmm, this is weird.  If there was no undo_list and
> tsk!=current, you set the refcnt to 2.  But if there was an
> undo list and tsk!=current, where do you inc the refcnt?
> 

I inc it  outside this function, as I don't call get_undo_list() if there is an
undo_list.
This appears most clearly in the next patch, in semundo_open() for example.

>> -undo_list = current->sysvsem.undo_list;
>> -if (!undo_list) {
>> -undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
>> +/* we must alloc a new one */
>> +undo_list = kmalloc(sizeof(*undo_list), GFP_KERNEL);
>>  if (undo_list == NULL)
>>  return -ENOMEM;
>> +
>> +task_lock(tsk);
>> +
>> +/* check again if there is an undo_list for this task */
>> +if (tsk->sysvsem.undo_list) {
>> +if (tsk != current)
>> +atomic_inc(&tsk->sysvsem.undo_list->refcnt);
>> +task_unlock(tsk);
>> +kfree(undo_list);
>> +goto out;
>> +}
>> +
>>  spin_lock_init(&undo_list->lock);
>> -atomic_set(&undo_list->refcnt, 1);
>> -undo_list->ns = get_ipc_ns(current->nsproxy->ipc_ns);
>> -current->sysvsem.undo_list = undo_list;
>> +/*
>> + * If tsk is not current (meaning that current is creating
>> + * a semundo_list for a target task through procfs), and if
>> + * it's not being exited then refcnt must be 2: the target
>> + * task tsk + current.
>> + */
>> +if (tsk == current)
>> +atomic_set(&undo_list->refcnt, 1);
>> +else if (!(tsk->flags & PF_EXITING))
>> +atomic_set(&undo_list->refcnt, 2);
>&g

Re: [PATCH 2.6.24-rc8-mm1 09/15] (RFC) IPC: new kernel API to change an ID

2008-01-31 Thread Pierre Peiffer

Hi again,

Thinking more about this, I think I must clarify why I choose this way.
In fact, the idea of these patches is to provide the missing user APIs (or
extend the existing ones) that allow to set or update _all_ properties of all
IPCs, as needed in the case of the checkpoint/restart of an application (the
current user API does not allow to specify an ID for a created IPC, for
example). And this, without changing the existing API of course.

And msgget(), semget() and shmget() does not have any parameter we can 
use to
specify an ID.
That's why I've decided to not change these routines and add a new 
control
command, IP_SETID, with which we can can change the ID of an IPC. (that looks to
me more straightforward and logical)

Now, this patch is, in fact, only a preparation for the patch 10/15 
which
really complete the user API by adding this IPC_SETID command.

(... continuing below ...)

Alexey Dobriyan wrote:
> On Tue, Jan 29, 2008 at 05:02:38PM +0100, [EMAIL PROTECTED] wrote:
>> This patch provides three new API to change the ID of an existing
>> System V IPCs.
>>
>> These APIs are:
>>  long msg_chid(struct ipc_namespace *ns, int id, int newid);
>>  long sem_chid(struct ipc_namespace *ns, int id, int newid);
>>  long shm_chid(struct ipc_namespace *ns, int id, int newid);
>>
>> They return 0 or an error code in case of failure.
>>
>> They may be useful for setting a specific ID for an IPC when preparing
>> a restart operation.
>>
>> To be successful, the following rules must be respected:
>> - the IPC exists (of course...)
>> - the new ID must satisfy the ID computation rule.
>> - the entry in the idr corresponding to the new ID must be free.
> 
>>  ipc/util.c  |   48 
>>  ipc/util.h  |1 +
>>  8 files changed, 197 insertions(+)
> 
> For the record, OpenVZ uses "create with predefined ID" method which
> leads to less code. For example, change at the end is all we want from
> ipc/util.c .

And in fact, you do that from kernel space, you don't have the constraint to fit
the existing user API.
Again, this patch, even if it presents a new kernel API, is in fact a
preparation for the next patch which introduces a new user API.

Do you think that this could fit your need ?

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc8-mm1 09/15] (RFC) IPC: new kernel API to change an ID

2008-01-30 Thread Pierre Peiffer

Alexey Dobriyan wrote:
> On Tue, Jan 29, 2008 at 05:02:38PM +0100, [EMAIL PROTECTED] wrote:
>> This patch provides three new API to change the ID of an existing
>> System V IPCs.
>>
>> These APIs are:
>>  long msg_chid(struct ipc_namespace *ns, int id, int newid);
>>  long sem_chid(struct ipc_namespace *ns, int id, int newid);
>>  long shm_chid(struct ipc_namespace *ns, int id, int newid);
>>
>> They return 0 or an error code in case of failure.
>>
>> They may be useful for setting a specific ID for an IPC when preparing
>> a restart operation.
>>
>> To be successful, the following rules must be respected:
>> - the IPC exists (of course...)
>> - the new ID must satisfy the ID computation rule.
>> - the entry in the idr corresponding to the new ID must be free.
> 
>>  ipc/util.c  |   48 
>>  ipc/util.h  |1 +
>>  8 files changed, 197 insertions(+)
> 
> For the record, OpenVZ uses "create with predefined ID" method which
> leads to less code. For example, change at the end is all we want from
> ipc/util.c .
> 

Yes, indeed, I saw that. The idea here is, at the end, to propose a more
"userspace oriented" solution.
As we can't use msgget(), etc, API to specify an ID, I think we can at least
change it afterwards

> Also, if ids were A and B at the moment of checkpoint, and during
> restart they became B and A you'll get collision in both ways which you
> techically can avoid by classic "tmp = A, A = B, B = tmp"

In the general case, yes, you're right.
In the case of the checkpoint/restart, this is not necessarily a problem, as we
will probably restart an application in an empty "container"/"namespace"; Thus
we can create all needed IPCs in an empty IPC namespace like this:
1. create first IPC
2. change its ID
3. create the second IPC
4. change its ID
5. etc..

But yes, I agree that if we can directly create an IPC with the right ID, it
would be better; may be with an IPC_CREATE command or something like that if the
direction is to do that from userspace.

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc8-mm1 15/15] (RFC) IPC/semaphores: add write() operation to semundo file in procfs

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch adds the write operation to the semundo file.
This write operation allows root to add or update the semundo list and
their values for a given process.

The user must provide some lines, each containing the semaphores ID
followed by the semaphores values to undo.

The operation failed if the given semaphore ID does not exist or if the
number of values does not match the number of semaphores in the array.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 fs/proc/base.c |2 
 ipc/sem.c  |  232 +++--
 2 files changed, 227 insertions(+), 7 deletions(-)

Index: b/fs/proc/base.c
===
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2256,7 +2256,7 @@ static const struct pid_entry tgid_base_
INF("io",   S_IRUGO, pid_io_accounting),
 #endif
 #ifdef CONFIG_SYSVIPC
-   REG("semundo",   S_IRUGO, semundo),
+   REG("semundo",   S_IWUSR|S_IRUGO, semundo),
 #endif
 };
 
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1580,6 +1580,9 @@ static struct seq_operations semundo_op 
 
 /*
  * semundo_open: open operation for /proc//semundo file
+ *
+ * If the file is opened in write mode and no semundo list exists for
+ * this target PID, it is created here.
  */
 static int semundo_open(struct inode *inode, struct file *file)
 {
@@ -1598,18 +1601,31 @@ static int semundo_open(struct inode *in
undo_list = rcu_dereference(task->sysvsem.undo_list);
if (undo_list)
ret = !atomic_inc_not_zero(&undo_list->refcnt);
-   put_task_struct(task);
}
rcu_read_unlock();
 
-   if (!task || ret)
+   if (!task)
return -EINVAL;
 
-   ret = seq_open(file, &semundo_op);
+   if (ret) {
+   put_task_struct(task);
+   return -EINVAL;
+   }
+
+
+   /* Create an undo_list if needed and if file is opened in write mode */
+   if (!undo_list && (file->f_flags & O_WRONLY || file->f_flags & O_RDWR))
+   ret = get_undo_list(task, &undo_list);
+
+   put_task_struct(task);
+
if (!ret) {
-   struct seq_file *m = file->private_data;
-   m->private = undo_list;
-   return 0;
+   ret = seq_open(file, &semundo_op);
+   if (!ret) {
+   struct seq_file *m = file->private_data;
+   m->private = undo_list;
+   return 0;
+   }
}
 
if (undo_list && atomic_dec_and_test(&undo_list->refcnt))
@@ -1617,6 +1633,209 @@ static int semundo_open(struct inode *in
return ret;
 }
 
+/* Skip all spaces at the beginning of the buffer */
+static inline int skip_space(const char __user **buf, size_t *len)
+{
+   char c = 0;
+   while (*len) {
+   if (get_user(c, *buf))
+   return -EFAULT;
+   if (c != '\t' && c != ' ')
+   break;
+   --*len;
+   ++*buf;
+   }
+   return c;
+}
+
+/* Retrieve the first numerical value contained in the string.
+ * Note: The value is supposed to be a 32-bit integer.
+ */
+static inline int get_next_value(const char __user **buf, size_t *len, int 
*val)
+{
+#define BUFLEN 11
+   int err, neg = 0, left;
+   char s[BUFLEN], *p;
+
+   err = skip_space(buf, len);
+   if (err < 0)
+   return err;
+   if (!*len)
+   return INT_MAX;
+   if (err == '\n') {
+   ++*buf;
+   --*len;
+   return INT_MAX;
+   }
+   if (err == '-') {
+   ++*buf;
+   --*len;
+   neg = 1;
+   }
+
+   left = *len;
+   if (left > sizeof(s) - 1)
+   left = sizeof(s) - 1;
+   if (copy_from_user(s, *buf, left))
+   return -EFAULT;
+
+   s[left] = 0;
+   p = s;
+   if (*p < '0' || *p > '9')
+   return -EINVAL;
+
+   *val = simple_strtoul(p, &p, 0);
+   if (neg)
+   *val = -(*val);
+
+   left = p-s;
+   (*len) -= left;
+   (*buf) += left;
+
+   return 0;
+#undef BUFLEN
+}
+
+/* semundo_readline: read a line of /proc//semundo file
+ * Return the number of value read or an errcode
+ */
+static inline int semundo_readline(const char __user **buf, size_t *left,
+  int *id,  short *array, int array_len)
+{
+   int i, val, err;
+
+   /* Read semid */
+   err = get_next_value(buf, left, id);
+   if (err)
+   return err;
+
+

[PATCH 2.6.24-rc8-mm1 14/15] (RFC) IPC/semaphores: prepare semundo code to work on another task than current

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

In order to modify the semundo-list of a task from procfs, we must be able to
work on any target task.
But all the existing code playing with the semundo-list, currently works
only on the 'current' task, and does not allow to specify any target task.

This patch changes all these routines to allow them to work on a specified
task, passed in parameter, instead of current.

This is mainly a preparation for the semundo_write() operation, on the
/proc//semundo file, as provided in the next patch.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 ipc/sem.c |   90 ++
 1 file changed, 68 insertions(+), 22 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1017,8 +1017,9 @@ asmlinkage long sys_semctl (int semid, i
 }
 
 /* If the task doesn't already have a undo_list, then allocate one
- * here.  We guarantee there is only one thread using this undo list,
- * and current is THE ONE
+ * here.
+ * The target task (tsk) is current in the general case, except when
+ * accessed from the procfs (ie when writting to /proc//semundo)
  *
  * If this allocation and assignment succeeds, but later
  * portions of this code fail, there is no need to free the sem_undo_list.
@@ -1026,22 +1027,60 @@ asmlinkage long sys_semctl (int semid, i
  * at exit time.
  *
  * This can block, so callers must hold no locks.
+ *
+ * Note: task_lock is used to synchronize 1. several possible concurrent
+ * creations and 2. the free of the undo_list (done when the task using it
+ * exits). In the second case, we check the PF_EXITING flag to not create
+ * an undo_list for a task which has exited.
+ * If there already is an undo_list for this task, there is no need
+ * to held the task-lock to retrieve it, as the pointer can not change
+ * afterwards.
  */
-static inline int get_undo_list(struct sem_undo_list **undo_listp)
+static inline int get_undo_list(struct task_struct *tsk,
+   struct sem_undo_list **ulp)
 {
-   struct sem_undo_list *undo_list;
+   if (tsk->sysvsem.undo_list == NULL) {
+   struct sem_undo_list *undo_list;
 
-   undo_list = current->sysvsem.undo_list;
-   if (!undo_list) {
-   undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+   /* we must alloc a new one */
+   undo_list = kmalloc(sizeof(*undo_list), GFP_KERNEL);
if (undo_list == NULL)
return -ENOMEM;
+
+   task_lock(tsk);
+
+   /* check again if there is an undo_list for this task */
+   if (tsk->sysvsem.undo_list) {
+   if (tsk != current)
+   atomic_inc(&tsk->sysvsem.undo_list->refcnt);
+   task_unlock(tsk);
+   kfree(undo_list);
+   goto out;
+   }
+
spin_lock_init(&undo_list->lock);
-   atomic_set(&undo_list->refcnt, 1);
-   undo_list->ns = get_ipc_ns(current->nsproxy->ipc_ns);
-   current->sysvsem.undo_list = undo_list;
+   /*
+* If tsk is not current (meaning that current is creating
+* a semundo_list for a target task through procfs), and if
+* it's not being exited then refcnt must be 2: the target
+* task tsk + current.
+*/
+   if (tsk == current)
+   atomic_set(&undo_list->refcnt, 1);
+   else if (!(tsk->flags & PF_EXITING))
+   atomic_set(&undo_list->refcnt, 2);
+   else {
+   task_unlock(tsk);
+   kfree(undo_list);
+   return -EINVAL;
+   }
+   undo_list->ns = get_ipc_ns(tsk->nsproxy->ipc_ns);
+   undo_list->proc_list = NULL;
+   tsk->sysvsem.undo_list = undo_list;
+   task_unlock(tsk);
}
-   *undo_listp = undo_list;
+out:
+   *ulp = tsk->sysvsem.undo_list;
return 0;
 }
 
@@ -1065,17 +1104,12 @@ static struct sem_undo *lookup_undo(stru
return un;
 }
 
-static struct sem_undo *find_undo(struct ipc_namespace *ns, int semid)
+static struct sem_undo *find_undo(struct sem_undo_list *ulp, int semid)
 {
struct sem_array *sma;
-   struct sem_undo_list *ulp;
struct sem_undo *un, *new;
+   struct ipc_namespace *ns;
int nsems;
-   int error;
-
-   error = get_undo_list(&ulp);
-   if (error)
-   return ERR_PTR(error);
 
spin_lock(&ulp->lock);
un = lookup_undo(ulp, semid);
@@ -1083,6 +1117,8 @@ static struct sem_undo *find_un

[PATCH 2.6.24-rc8-mm1 13/15] (RFC) IPC/semaphores: per semundo file in procfs

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch adds a new procfs interface to display the per-process semundo
data.

A new per-PID file is added, named "semundo".
It contains one line per semaphore IPC where there is something to undo for
this process.
Then, each line contains the semid followed by each undo value
corresponding to each semaphores of the semaphores array.

This interface will be specially useful to allow a user to access
these data, for example for checkpointing a process

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 fs/proc/base.c |3 +
 fs/proc/internal.h |1 
 ipc/sem.c  |  153 +
 3 files changed, 157 insertions(+)

Index: b/fs/proc/base.c
===
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2255,6 +2255,9 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io",   S_IRUGO, pid_io_accounting),
 #endif
+#ifdef CONFIG_SYSVIPC
+   REG("semundo",   S_IRUGO, semundo),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
Index: b/fs/proc/internal.h
===
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -64,6 +64,7 @@ extern const struct file_operations proc
 extern const struct file_operations proc_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_semundo_operations;
 
 void free_proc_entry(struct proc_dir_entry *de);
 
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1435,4 +1435,157 @@ static int sysvipc_sem_proc_show(struct 
  sma->sem_otime,
  sma->sem_ctime);
 }
+
+
+/* iterator */
+static void *semundo_start(struct seq_file *m, loff_t *ppos)
+{
+   struct sem_undo_list *undo_list = m->private;
+   struct sem_undo *undo;
+   loff_t pos = *ppos;
+
+   if (!undo_list)
+   return NULL;
+
+   if (pos < 0)
+   return NULL;
+
+   /* If undo_list is not NULL, it means that we've successfully grabbed
+* a refcnt in semundo_open. That prevents the undo_list itself and the
+* undo elements to be freed
+*/
+   spin_lock(&undo_list->lock);
+   undo = undo_list->proc_list;
+   while (undo) {
+   if ((undo->semid != -1) && !(pos--))
+   break;
+   undo = undo->proc_next;
+   }
+   spin_unlock(&undo_list->lock);
+
+   return undo;
+}
+
+static void *semundo_next(struct seq_file *m, void *v, loff_t *ppos)
+{
+   struct sem_undo *undo = v;
+   struct sem_undo_list *undo_list = m->private;
+
+   /*
+* No need to protect against undo_list being NULL, if we are here,
+* it can't be NULL.
+* Moreover, by releasing the lock between each iteration, we allow the
+* list to change between each iteration, but we only want to guarantee
+* to have access to some valid data during the _show, not to have a
+* full coherent view of the whole list.
+*/
+   spin_lock(&undo_list->lock);
+   do {
+   undo = undo->proc_next;
+   } while (undo && (undo->semid == -1));
+   ++*ppos;
+   spin_unlock(&undo_list->lock);
+
+   return undo;
+}
+
+static void semundo_stop(struct seq_file *m, void *v)
+{
+}
+
+static int semundo_show(struct seq_file *m, void *v)
+{
+   struct sem_undo_list *undo_list = m->private;
+   struct sem_undo *u = v;
+   int nsems, i;
+   struct sem_array *sma;
+
+   /*
+* This semid has been deleted, ignore it.
+* Even if we skipped all sem_undo belonging to deleted semid
+* in semundo_next(), some more deletions may have happened.
+*/
+   if (u->semid == -1)
+   return 0;
+
+   seq_printf(m, "%10d", u->semid);
+
+   sma = sem_lock(undo_list->ns, u->semid);
+   if (IS_ERR(sma))
+   goto out;
+
+   nsems = sma->sem_nsems;
+   sem_unlock(sma);
+
+   for (i = 0; i < nsems; i++)
+   seq_printf(m, " %6d", u->semadj[i]);
+
+out:
+   seq_putc(m, '\n');
+   return 0;
+}
+
+static struct seq_operations semundo_op = {
+   .start  = semundo_start,
+   .next   = semundo_next,
+   .stop   = semundo_stop,
+   .show   = semundo_show
+};
+
+/*
+ * semundo_open: open operation for /proc//semundo file
+ */
+static int semundo_open(struct inode *inode, struct file *file)
+{
+   struct task_str

[PATCH 2.6.24-rc8-mm1 12/15] (RFC) IPC/semaphores: make use of RCU to free the sem_undo_list

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Today, the sem_undo_list is freed when the last task using it exits.
There is no mechanism in place, that allows a safe concurrent access to
the sem_undo_list of a target task and protects efficiently against a
task-exit.

That is okay for now as we don't need this.

As I would like to provide a /proc interface to access this data, I need
such a safe access, without blocking the target task if possible. 

This patch proposes to introduce the use of RCU to delay the real free of
these sem_undo_list structures. They can then be accessed in a safe manner
by any tasks inside read critical section, this way:

struct sem_undo_list *undo_list;
int ret;
...
rcu_read_lock();
undo_list = rcu_dereference(task->sysvsem.undo_list);
if (undo_list)
ret = atomic_inc_not_zero(&undo_list->refcnt);
rcu_read_unlock();
...
if (undo_list && ret) {
/* section where undo_list can be used quietly */
...
}
    ...

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 include/linux/sem.h |7 +--
 ipc/sem.c   |   42 ++
 2 files changed, 31 insertions(+), 18 deletions(-)

Index: b/include/linux/sem.h
===
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -115,7 +115,8 @@ struct sem_queue {
 };
 
 /* Each task has a list of undo requests. They are executed automatically
- * when the process exits.
+ * when the last refcnt of sem_undo_list is released (ie when the process exits
+ * in the general case)
  */
 struct sem_undo {
struct sem_undo *   proc_next;  /* next entry on this process */
@@ -125,12 +126,14 @@ struct sem_undo {
 };
 
 /* sem_undo_list controls shared access to the list of sem_undo structures
- * that may be shared among all a CLONE_SYSVSEM task group.
+ * that may be shared among all a CLONE_SYSVSEM task group or with an external
+ * process which changes the list through procfs.
  */ 
 struct sem_undo_list {
atomic_trefcnt;
spinlock_t  lock;
struct sem_undo *proc_list;
+   struct ipc_namespace *ns;
 };
 
 struct sysv_sem {
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1038,6 +1038,7 @@ static inline int get_undo_list(struct s
return -ENOMEM;
spin_lock_init(&undo_list->lock);
atomic_set(&undo_list->refcnt, 1);
+   undo_list->ns = get_ipc_ns(current->nsproxy->ipc_ns);
current->sysvsem.undo_list = undo_list;
}
*undo_listp = undo_list;
@@ -1316,7 +1317,8 @@ int copy_semundo(unsigned long clone_fla
 }
 
 /*
- * add semadj values to semaphores, free undo structures.
+ * add semadj values to semaphores, free undo structures, if there is no
+ * more user.
  * undo structures are not freed when semaphore arrays are destroyed
  * so some of them may be out of date.
  * IMPLEMENTATION NOTE: There is some confusion over whether the
@@ -1326,23 +1328,17 @@ int copy_semundo(unsigned long clone_fla
  * The original implementation attempted to do this (queue and wait).
  * The current implementation does not do so. The POSIX standard
  * and SVID should be consulted to determine what behavior is mandated.
+ *
+ * Note:
+ * A concurrent task is only allowed to access and go through the list
+ * of sem_undo if it successfully grabs a refcnt.
  */
-void exit_sem(struct task_struct *tsk)
+static void free_semundo_list(struct sem_undo_list *undo_list)
 {
-   struct sem_undo_list *undo_list;
struct sem_undo *u, **up;
-   struct ipc_namespace *ns;
 
-   undo_list = tsk->sysvsem.undo_list;
-   if (!undo_list)
-   return;
-
-   if (!atomic_dec_and_test(&undo_list->refcnt))
-   return;
-
-   ns = tsk->nsproxy->ipc_ns;
-   /* There's no need to hold the semundo list lock, as current
- * is the last task exiting for this undo list.
+   /* There's no need to hold the semundo list lock, as there are
+* no more tasks or possible users for this undo list.
 */
for (up = &undo_list->proc_list; (u = *up); *up = u->proc_next, 
kfree(u)) {
struct sem_array *sma;
@@ -1354,7 +1350,7 @@ void exit_sem(struct task_struct *tsk)
 
if(semid == -1)
continue;
-   sma = sem_lock(ns, semid);
+   sma = sem_lock(undo_list->ns, semid);
if (IS_ERR(sma))
continue;
 
@@ -1368,7 +1364,8 @@ void exit_sem(struct task_struct *tsk)
if (u == un)
goto found;
}
-

[PATCH 2.6.24-rc8-mm1 11/15] (RFC) IPC: new IPC_SETALL command to modify all settings

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch adds a new IPC_SETALL command to the System V IPCs set of commands,
which allows to change all the settings of an IPC

It works exactly the same way as the IPC_SET command, except that it
additionally changes all the times and the pids values

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 include/linux/ipc.h  |1 +
 ipc/compat.c |3 +++
 ipc/msg.c|   15 +--
 ipc/sem.c|   10 +-
 ipc/shm.c|   13 -
 ipc/util.c   |7 ++-
 security/selinux/hooks.c |3 +++
 7 files changed, 47 insertions(+), 5 deletions(-)

Index: b/include/linux/ipc.h
===
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -40,6 +40,7 @@ struct ipc_perm
 #define IPC_STAT   2 /* get ipc_perm options */
 #define IPC_INFO   3 /* see ipcs */
 #define IPC_SETID  4 /* set ipc ID */
+#define IPC_SETALL 5 /* set all parameters */
 
 /*
  * Version flags for semctl, msgctl, and shmctl commands
Index: b/ipc/compat.c
===
--- a/ipc/compat.c
+++ b/ipc/compat.c
@@ -282,6 +282,7 @@ long compat_sys_semctl(int first, int se
err = -EFAULT;
break;
 
+   case IPC_SETALL:
case IPC_SET:
if (version == IPC_64) {
err = get_compat_semid64_ds(&s64, compat_ptr(pad));
@@ -431,6 +432,7 @@ long compat_sys_msgctl(int first, int se
err = sys_msgctl(first, second, uptr);
break;
 
+   case IPC_SETALL:
case IPC_SET:
if (version == IPC_64) {
err = get_compat_msqid64(&m64, uptr);
@@ -621,6 +623,7 @@ long compat_sys_shmctl(int first, int se
break;
 
 
+   case IPC_SETALL:
case IPC_SET:
if (version == IPC_64) {
err = get_compat_shmid64_ds(&s64, uptr);
Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -426,7 +426,7 @@ static int msgctl_down(struct ipc_namesp
struct msg_queue *msq;
int err;
 
-   if (cmd == IPC_SET) {
+   if (cmd == IPC_SET || cmd == IPC_SETALL) {
if (copy_msqid_from_user(&msqid64, buf, version))
return -EFAULT;
}
@@ -447,6 +447,7 @@ static int msgctl_down(struct ipc_namesp
freeque(ns, ipcp);
goto out_up;
case IPC_SET:
+   case IPC_SETALL:
if (msqid64.msg_qbytes > ns->msg_ctlmnb &&
!capable(CAP_SYS_RESOURCE)) {
err = -EPERM;
@@ -456,7 +457,14 @@ static int msgctl_down(struct ipc_namesp
msq->q_qbytes = msqid64.msg_qbytes;
 
ipc_update_perm(&msqid64.msg_perm, ipcp);
-   msq->q_ctime = get_seconds();
+   if (cmd == IPC_SETALL) {
+   msq->q_stime = msqid64.msg_stime;
+   msq->q_rtime = msqid64.msg_rtime;
+   msq->q_ctime = msqid64.msg_ctime;
+   msq->q_lspid = msqid64.msg_lspid;
+   msq->q_lrpid = msqid64.msg_lrpid;
+   } else
+   msq->q_ctime = get_seconds();
/* sleeping receivers might be excluded by
 * stricter permissions.
 */
@@ -507,6 +515,8 @@ asmlinkage long sys_msgctl(int msqid, in
return -EINVAL;
 
version = ipc_parse_version(&cmd);
+   if (version < 0)
+   return -EINVAL;
ns = current->nsproxy->ipc_ns;
 
switch (cmd) {
@@ -594,6 +604,7 @@ asmlinkage long sys_msgctl(int msqid, in
return success_return;
}
case IPC_SET:
+   case IPC_SETALL:
case IPC_RMID:
err = msgctl_down(ns, msqid, cmd, buf, version);
return err;
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -913,7 +913,7 @@ static int semctl_down(struct ipc_namesp
struct semid64_ds semid64;
struct kern_ipc_perm *ipcp;
 
-   if(cmd == IPC_SET) {
+   if (cmd == IPC_SET || cmd == IPC_SETALL) {
if (copy_semid_from_user(&semid64, arg.buf, version))
return -EFAULT;
}
@@ -936,6 +936,11 @@ static int semctl_down(struct ipc_namesp
ipc_update_perm(&semid64.sem_perm, ipcp);
sma->sem_ctime = get_seconds();
break;
+   case IPC_SETALL:
+   ipc_update_perm(&semid64.sem_perm, ipcp)

[PATCH 2.6.24-rc8-mm1 10/15] (RFC) IPC: new IPC_SETID command to modify an ID

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch adds a new IPC_SETID command to the System V IPCs set of commands,
which allows to change the ID of an existing IPC.

This command can be used through the semctl/shmctl/msgctl API, with the new
ID passed as the third argument for msgctl and shmctl (instead of a pointer)
and through the fourth argument for semctl.

To be successful, the following rules must be respected:
- the IPC exists
- the user must be allowed to change the IPC attributes regarding the IPC
  permissions.
- the new ID must satisfy the ID computation rule.
- the entry (in the kernel internal table of IPCs) corresponding to the new
  ID must be free.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 include/linux/ipc.h  |9 +
 ipc/compat.c |3 +++
 ipc/msg.c|   27 ++-
 ipc/sem.c|   27 ++-
 ipc/shm.c|   27 ++-
 security/selinux/hooks.c |3 +++
 6 files changed, 89 insertions(+), 7 deletions(-)

Index: b/include/linux/ipc.h
===
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -35,10 +35,11 @@ struct ipc_perm
  * Control commands used with semctl, msgctl and shmctl 
  * see also specific commands in sem.h, msg.h and shm.h
  */
-#define IPC_RMID 0 /* remove resource */
-#define IPC_SET  1 /* set ipc_perm options */
-#define IPC_STAT 2 /* get ipc_perm options */
-#define IPC_INFO 3 /* see ipcs */
+#define IPC_RMID   0 /* remove resource */
+#define IPC_SET1 /* set ipc_perm options */
+#define IPC_STAT   2 /* get ipc_perm options */
+#define IPC_INFO   3 /* see ipcs */
+#define IPC_SETID  4 /* set ipc ID */
 
 /*
  * Version flags for semctl, msgctl, and shmctl commands
Index: b/ipc/compat.c
===
--- a/ipc/compat.c
+++ b/ipc/compat.c
@@ -253,6 +253,7 @@ long compat_sys_semctl(int first, int se
switch (third & (~IPC_64)) {
case IPC_INFO:
case IPC_RMID:
+   case IPC_SETID:
case SEM_INFO:
case GETVAL:
case GETPID:
@@ -425,6 +426,7 @@ long compat_sys_msgctl(int first, int se
switch (second & (~IPC_64)) {
case IPC_INFO:
case IPC_RMID:
+   case IPC_SETID:
case MSG_INFO:
err = sys_msgctl(first, second, uptr);
break;
@@ -597,6 +599,7 @@ long compat_sys_shmctl(int first, int se
 
switch (second & (~IPC_64)) {
case IPC_RMID:
+   case IPC_SETID:
case SHM_LOCK:
case SHM_UNLOCK:
err = sys_shmctl(first, second, uptr);
Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -329,7 +329,8 @@ retry:
msg_unlock(msq);
up_write(&msg_ids(ns).rw_mutex);
 
-   /* ipc_chid may return -EAGAIN in case of memory requirement */
+   /* msg_chid_nolock may return -EAGAIN if there is no more free idr
+  entry, just go and retry by filling again de idr cache */
if (err == -EAGAIN)
goto retry;
 
@@ -465,6 +466,9 @@ static int msgctl_down(struct ipc_namesp
 */
ss_wakeup(&msq->q_senders, 0);
break;
+   case IPC_SETID:
+   err = msg_chid_nolock(ns, msq, (int)(long)buf);
+   break;
default:
err = -EINVAL;
}
@@ -475,6 +479,24 @@ out_up:
return err;
 }
 
+static int msgctl_setid(struct ipc_namespace *ns, int msqid, int cmd,
+   struct msqid_ds __user *buf, int version)
+{
+   int err;
+retry:
+   err = idr_pre_get(&msg_ids(ns).ipcs_idr, GFP_KERNEL);
+   if (!err)
+   return -ENOMEM;
+
+   err = msgctl_down(ns, msqid, cmd, buf, version);
+
+   /* msgctl_down may return -EAGAIN if there is no more free idr
+  entry, just go and retry by filling again de idr cache */
+   if (err == -EAGAIN)
+   goto retry;
+   return err;
+}
+
 asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
 {
struct msg_queue *msq;
@@ -575,6 +597,9 @@ asmlinkage long sys_msgctl(int msqid, in
case IPC_RMID:
err = msgctl_down(ns, msqid, cmd, buf, version);
return err;
+   case IPC_SETID:
+   err = msgctl_setid(ns, msqid, cmd, buf, version);
+   return err;
default:
return  -EINVAL;
}
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -608,7 +608,8 @@ retry:
sem_unlock(sma);
up_write(&sem_ids(ns).rw_mutex);
 
-   /* ipc_chid

[PATCH 2.6.24-rc8-mm1 09/15] (RFC) IPC: new kernel API to change an ID

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch provides three new API to change the ID of an existing
System V IPCs.

These APIs are:
long msg_chid(struct ipc_namespace *ns, int id, int newid);
long sem_chid(struct ipc_namespace *ns, int id, int newid);
long shm_chid(struct ipc_namespace *ns, int id, int newid);

They return 0 or an error code in case of failure.

They may be useful for setting a specific ID for an IPC when preparing
a restart operation.

To be successful, the following rules must be respected:
- the IPC exists (of course...)
- the new ID must satisfy the ID computation rule.
- the entry in the idr corresponding to the new ID must be free.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 include/linux/msg.h |2 ++
 include/linux/sem.h |2 ++
 include/linux/shm.h |3 +++
 ipc/msg.c   |   45 +
 ipc/sem.c   |   51 +++
 ipc/shm.c   |   45 +
 ipc/util.c  |   48 
 ipc/util.h  |1 +
 8 files changed, 197 insertions(+)

Index: b/include/linux/msg.h
===
--- a/include/linux/msg.h
+++ b/include/linux/msg.h
@@ -63,6 +63,7 @@ struct msginfo {
 
 #ifdef __KERNEL__
 #include 
+#include 
 
 /* one msg_msg structure for each message */
 struct msg_msg {
@@ -96,6 +97,7 @@ extern long do_msgsnd(int msqid, long mt
size_t msgsz, int msgflg);
 extern long do_msgrcv(int msqid, long *pmtype, void __user *mtext,
size_t msgsz, long msgtyp, int msgflg);
+long msg_chid(struct ipc_namespace *ns, int id, int newid);
 
 #endif /* __KERNEL__ */
 
Index: b/include/linux/sem.h
===
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -138,9 +138,11 @@ struct sysv_sem {
 };
 
 #ifdef CONFIG_SYSVIPC
+#include 
 
 extern int copy_semundo(unsigned long clone_flags, struct task_struct *tsk);
 extern void exit_sem(struct task_struct *tsk);
+long sem_chid(struct ipc_namespace *ns, int id, int newid);
 
 #else
 static inline int copy_semundo(unsigned long clone_flags, struct task_struct 
*tsk)
Index: b/include/linux/shm.h
===
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -104,8 +104,11 @@ struct shmid_kernel /* private to the ke
 #define SHM_NORESERVE   01  /* don't check for reservations */
 
 #ifdef CONFIG_SYSVIPC
+#include 
+
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long 
*addr);
 extern int is_file_shm_hugepages(struct file *file);
+long shm_chid(struct ipc_namespace *ns, int id, int newid);
 #else
 static inline long do_shmat(int shmid, char __user *shmaddr,
int shmflg, unsigned long *addr)
Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -291,6 +291,51 @@ asmlinkage long sys_msgget(key_t key, in
return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
 }
 
+/* must be called with mutex and msq locks held */
+static long msg_chid_nolock(struct ipc_namespace *ns, struct msg_queue *msq,
+   int newid)
+{
+   long err;
+
+   err = ipc_chid(&msg_ids(ns), msq->q_perm.id, newid);
+   if (!err)
+   msq->q_ctime = get_seconds();
+
+   return err;
+}
+
+/* API to use for changing an id from kernel space, not from the syscall, as
+   there is no permission check done here */
+long msg_chid(struct ipc_namespace *ns, int id, int newid)
+{
+   long err;
+   struct msg_queue *msq;
+
+retry:
+   err = idr_pre_get(&msg_ids(ns).ipcs_idr, GFP_KERNEL);
+   if (!err)
+   return -ENOMEM;
+
+   down_write(&msg_ids(ns).rw_mutex);
+   msq = msg_lock_check(ns, id);
+
+   if (IS_ERR(msq)) {
+   up_write(&msg_ids(ns).rw_mutex);
+   return PTR_ERR(msq);
+   }
+
+   err = msg_chid_nolock(ns, msq, newid);
+
+   msg_unlock(msq);
+   up_write(&msg_ids(ns).rw_mutex);
+
+   /* ipc_chid may return -EAGAIN in case of memory requirement */
+   if (err == -EAGAIN)
+   goto retry;
+
+   return err;
+}
+
 static inline unsigned long
 copy_msqid_to_user(void __user *buf, struct msqid64_ds *in, int version)
 {
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -564,6 +564,57 @@ static void freeary(struct ipc_namespace
ipc_rcu_putref(sma);
 }
 
+/* must be called with rw_mutex and sma locks held */
+static long sem_chid_nolock(struct ipc_namespace *ns, struct sem_array *sm

[PATCH 2.6.24-rc8-mm1 08/15] IPC: consolidate all xxxctl_down() functions

2008-01-29 Thread pierre . peiffer

semctl_down(), msgctl_down() and shmctl_down() are used to handle the same
set of commands for each kind of IPC. They all start to do the same job (they
retrieve the ipc and do some permission checks) before handling the commands
on their own.

This patch proposes to consolidate this by moving these same pieces of code
into one common function called ipcctl_pre_down().
It simplifies a little these xxxctl_down() functions and increases a little
the maintainability.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 ipc/msg.c  |   48 +---
 ipc/sem.c  |   42 --
 ipc/shm.c  |   42 --
 ipc/util.c |   51 +++
 ipc/util.h |2 ++
 5 files changed, 66 insertions(+), 119 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -142,21 +142,6 @@ void __init sem_init (void)
 }
 
 /*
- * This routine is called in the paths where the rw_mutex is held to protect
- * access to the idr tree.
- */
-static inline struct sem_array *sem_lock_check_down(struct ipc_namespace *ns,
-   int id)
-{
-   struct kern_ipc_perm *ipcp = ipc_lock_check_down(&sem_ids(ns), id);
-
-   if (IS_ERR(ipcp))
-   return (struct sem_array *)ipcp;
-
-   return container_of(ipcp, struct sem_array, sem_perm);
-}
-
-/*
  * sem_lock_(check_) routines are called in the paths where the rw_mutex
  * is not held.
  */
@@ -880,31 +865,12 @@ static int semctl_down(struct ipc_namesp
if (copy_semid_from_user(&semid64, arg.buf, version))
return -EFAULT;
}
-   down_write(&sem_ids(ns).rw_mutex);
-   sma = sem_lock_check_down(ns, semid);
-   if (IS_ERR(sma)) {
-   err = PTR_ERR(sma);
-   goto out_up;
-   }
-
-   ipcp = &sma->sem_perm;
 
-   err = audit_ipc_obj(ipcp);
-   if (err)
-   goto out_unlock;
+   ipcp = ipcctl_pre_down(&sem_ids(ns), semid, cmd, &semid64.sem_perm, 0);
+   if (IS_ERR(ipcp))
+   return PTR_ERR(ipcp);
 
-   if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(0, semid64.sem_perm.uid,
-semid64.sem_perm.gid,
-semid64.sem_perm.mode);
-   if (err)
-   goto out_unlock;
-   }
-   if (current->euid != ipcp->cuid && 
-   current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) {
-   err=-EPERM;
-   goto out_unlock;
-   }
+   sma = container_of(ipcp, struct sem_array, sem_perm);
 
err = security_sem_semctl(sma, cmd);
if (err)
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -774,6 +774,57 @@ void ipc_update_perm(struct ipc64_perm *
| (in->mode & S_IRWXUGO);
 }
 
+/**
+ * ipcctl_pre_down - retrieve an ipc and check permissions for some IPC_XXX cmd
+ * @ids:  the table of ids where to look for the ipc
+ * @id:   the id of the ipc to retrieve
+ * @cmd:  the cmd to check
+ * @perm: the permission to set
+ * @extra_perm: one extra permission parameter used by msq
+ *
+ * This function does some common audit and permissions check for some IPC_XXX
+ * cmd and is called from semctl_down, shmctl_down and msgctl_down.
+ * It must be called without any lock held and
+ *  - retrieves the ipc with the given id in the given table.
+ *  - performs some audit and permission check, depending on the given cmd
+ *  - returns the ipc with both ipc and rw_mutex locks held in case of success
+ *or an err-code without any lock held otherwise.
+ */
+struct kern_ipc_perm *ipcctl_pre_down(struct ipc_ids *ids, int id, int cmd,
+ struct ipc64_perm *perm, int extrat_perm)
+{
+   struct kern_ipc_perm *ipcp;
+   int err;
+
+   down_write(&ids->rw_mutex);
+   ipcp = ipc_lock_check_down(ids, id);
+   if (IS_ERR(ipcp)) {
+   err = PTR_ERR(ipcp);
+   goto out_up;
+   }
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(extrat_perm, perm->uid,
+perm->gid, perm->mode);
+   if (err)
+   goto out_unlock;
+   }
+   if (current->euid == ipcp->cuid ||
+   current->euid == ipcp->uid || capable(CAP_SYS_ADMIN))
+   return ipcp;
+
+   err = -EPERM;
+out_unlock:
+   ipc_unlock(ipcp);
+out_up:
+   up_write(&ids-&

[PATCH 2.6.24-rc8-mm1 07/15] IPC: introduce ipc_update_perm()

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

The IPC_SET command performs the same permission setting for all IPCs.
This patch introduces a common ipc_update_perm() function to update these
permissions and makes use of it for all IPCs.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c  |5 +
 ipc/sem.c  |5 +
 ipc/shm.c  |5 +
 ipc/util.c |   13 +
 ipc/util.h |1 +
 5 files changed, 17 insertions(+), 12 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -447,10 +447,7 @@ static int msgctl_down(struct ipc_namesp
 
msq->q_qbytes = msqid64.msg_qbytes;
 
-   ipcp->uid = msqid64.msg_perm.uid;
-   ipcp->gid = msqid64.msg_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
-(S_IRWXUGO & msqid64.msg_perm.mode);
+   ipc_update_perm(&msqid64.msg_perm, ipcp);
msq->q_ctime = get_seconds();
/* sleeping receivers might be excluded by
 * stricter permissions.
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -915,10 +915,7 @@ static int semctl_down(struct ipc_namesp
freeary(ns, ipcp);
goto out_up;
case IPC_SET:
-   ipcp->uid = semid64.sem_perm.uid;
-   ipcp->gid = semid64.sem_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
-   | (semid64.sem_perm.mode & S_IRWXUGO);
+   ipc_update_perm(&semid64.sem_perm, ipcp);
sma->sem_ctime = get_seconds();
break;
default:
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -665,10 +665,7 @@ static int shmctl_down(struct ipc_namesp
do_shm_rmid(ns, ipcp);
goto out_up;
case IPC_SET:
-   ipcp->uid = shmid64.shm_perm.uid;
-   ipcp->gid = shmid64.shm_perm.gid;
-   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
-   | (shmid64.shm_perm.mode & S_IRWXUGO);
+   ipc_update_perm(&shmid64.shm_perm, ipcp);
shp->shm_ctim = get_seconds();
break;
default:
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -761,6 +761,19 @@ int ipcget(struct ipc_namespace *ns, str
return ipcget_public(ns, ids, ops, params);
 }
 
+/**
+ * ipc_update_perm - update the permissions of an IPC.
+ * @in:  the permission given as input.
+ * @out: the permission of the ipc to set.
+ */
+void ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out)
+{
+   out->uid = in->uid;
+   out->gid = in->gid;
+   out->mode = (out->mode & ~S_IRWXUGO)
+   | (in->mode & S_IRWXUGO);
+}
+
 #ifdef __ARCH_WANT_IPC_PARSE_VERSION
 
 
Index: b/ipc/util.h
===
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -112,6 +112,7 @@ struct kern_ipc_perm *ipc_lock(struct ip
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
+void ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
 
 #if defined(__ia64__) || defined(__x86_64__) || defined(__hppa__) || 
defined(__XTENSA__)
   /* On IA-64, we always use the "64-bit version" of the IPC structures.  */ 

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc8-mm1 06/15] IPC: get rid of the use *_setbuf structure.

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

All IPCs make use of an intermetiate *_setbuf structure to handle the
IPC_SET command. This is not really needed and, moreover, it complicate
a little bit the code.

This patch get rid of the use of it and uses directly the semid64_ds/
msgid64_ds/shmid64_ds structure.

In addition of removing one struture declaration, it also simplifies
and improves a little bit the common 64-bits path.
Moreover, this will simplify the code for handling the IPC_SETALL
command provided in the next patch.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c |   51 ++-
 ipc/sem.c |   40 ++--
 ipc/shm.c |   41 ++---
 3 files changed, 46 insertions(+), 86 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -351,31 +351,14 @@ copy_msqid_to_user(void __user *buf, str
}
 }
 
-struct msq_setbuf {
-   unsigned long   qbytes;
-   uid_t   uid;
-   gid_t   gid;
-   mode_t  mode;
-};
-
 static inline unsigned long
-copy_msqid_from_user(struct msq_setbuf *out, void __user *buf, int version)
+copy_msqid_from_user(struct msqid64_ds *out, void __user *buf, int version)
 {
switch(version) {
case IPC_64:
-   {
-   struct msqid64_ds tbuf;
-
-   if (copy_from_user(&tbuf, buf, sizeof(tbuf)))
+   if (copy_from_user(out, buf, sizeof(*out)))
return -EFAULT;
-
-   out->qbytes = tbuf.msg_qbytes;
-   out->uid= tbuf.msg_perm.uid;
-   out->gid= tbuf.msg_perm.gid;
-   out->mode   = tbuf.msg_perm.mode;
-
return 0;
-   }
case IPC_OLD:
{
struct msqid_ds tbuf_old;
@@ -383,14 +366,14 @@ copy_msqid_from_user(struct msq_setbuf *
if (copy_from_user(&tbuf_old, buf, sizeof(tbuf_old)))
return -EFAULT;
 
-   out->uid= tbuf_old.msg_perm.uid;
-   out->gid= tbuf_old.msg_perm.gid;
-   out->mode   = tbuf_old.msg_perm.mode;
+   out->msg_perm.uid   = tbuf_old.msg_perm.uid;
+   out->msg_perm.gid   = tbuf_old.msg_perm.gid;
+   out->msg_perm.mode  = tbuf_old.msg_perm.mode;
 
if (tbuf_old.msg_qbytes == 0)
-   out->qbytes = tbuf_old.msg_lqbytes;
+   out->msg_qbytes = tbuf_old.msg_lqbytes;
else
-   out->qbytes = tbuf_old.msg_qbytes;
+   out->msg_qbytes = tbuf_old.msg_qbytes;
 
return 0;
}
@@ -408,12 +391,12 @@ static int msgctl_down(struct ipc_namesp
   struct msqid_ds __user *buf, int version)
 {
struct kern_ipc_perm *ipcp;
-   struct msq_setbuf setbuf;
+   struct msqid64_ds msqid64;
struct msg_queue *msq;
int err;
 
if (cmd == IPC_SET) {
-   if (copy_msqid_from_user(&setbuf, buf, version))
+   if (copy_msqid_from_user(&msqid64, buf, version))
return -EFAULT;
}
 
@@ -431,8 +414,10 @@ static int msgctl_down(struct ipc_namesp
goto out_unlock;
 
if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
-setbuf.mode);
+   err = audit_ipc_set_perm(msqid64.msg_qbytes,
+msqid64.msg_perm.uid,
+msqid64.msg_perm.gid,
+msqid64.msg_perm.mode);
if (err)
goto out_unlock;
}
@@ -454,18 +439,18 @@ static int msgctl_down(struct ipc_namesp
freeque(ns, ipcp);
goto out_up;
case IPC_SET:
-   if (setbuf.qbytes > ns->msg_ctlmnb &&
+   if (msqid64.msg_qbytes > ns->msg_ctlmnb &&
!capable(CAP_SYS_RESOURCE)) {
err = -EPERM;
goto out_unlock;
}
 
-   msq->q_qbytes = setbuf.qbytes;
+   msq->q_qbytes = msqid64.msg_qbytes;
 
-   ipcp->uid = setbuf.uid;
-   ipcp->gid = setbuf.gid;
+   ipcp->uid = msqid64.msg_perm.uid;
+   ipcp->gid = msqid64.msg_perm.gid;
ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
-(S_IRWXUGO & setbuf.mode);
+

[PATCH 2.6.24-rc8-mm1 04/15] IPC/semaphores: move the rwmutex handling inside semctl_down

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

semctl_down is called with the rwmutex (the one which protects the
list of ipcs) taken in write mode.
This patch moves this rwmutex taken in write-mode inside semctl_down.
This has the advantages of reducing a little bit the window during
which this rwmutex is taken, clarifying sys_semctl, and finally of
having a coherent behaviour with [shm|msg]ctl_down

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/sem.c |   24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -877,6 +877,11 @@ static inline unsigned long copy_semid_f
}
 }
 
+/*
+ * This function handles some semctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
 static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
int cmd, int version, union semun arg)
 {
@@ -889,9 +894,12 @@ static int semctl_down(struct ipc_namesp
if(copy_semid_from_user (&setbuf, arg.buf, version))
return -EFAULT;
}
+   down_write(&sem_ids(ns).rw_mutex);
sma = sem_lock_check_down(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
+   if (IS_ERR(sma)) {
+   err = PTR_ERR(sma);
+   goto out_up;
+   }
 
ipcp = &sma->sem_perm;
 
@@ -917,26 +925,22 @@ static int semctl_down(struct ipc_namesp
switch(cmd){
case IPC_RMID:
freeary(ns, ipcp);
-   err = 0;
-   break;
+   goto out_up;
case IPC_SET:
ipcp->uid = setbuf.uid;
ipcp->gid = setbuf.gid;
ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
| (setbuf.mode & S_IRWXUGO);
sma->sem_ctime = get_seconds();
-   sem_unlock(sma);
-   err = 0;
break;
default:
-   sem_unlock(sma);
err = -EINVAL;
-   break;
}
-   return err;
 
 out_unlock:
sem_unlock(sma);
+out_up:
+   up_write(&sem_ids(ns).rw_mutex);
return err;
 }
 
@@ -970,9 +974,7 @@ asmlinkage long sys_semctl (int semid, i
return err;
case IPC_RMID:
case IPC_SET:
-   down_write(&sem_ids(ns).rw_mutex);
err = semctl_down(ns,semid,semnum,cmd,version,arg);
-   up_write(&sem_ids(ns).rw_mutex);
    return err;
default:
return -EINVAL;

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc8-mm1 05/15] IPC/semaphores: remove one unused parameter from semctl_down()

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

semctl_down() takes one unused parameter: semnum.
This patch proposes to get rid of it.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 ipc/sem.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -882,8 +882,8 @@ static inline unsigned long copy_semid_f
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
-   int cmd, int version, union semun arg)
+static int semctl_down(struct ipc_namespace *ns, int semid,
+  int cmd, int version, union semun arg)
 {
struct sem_array *sma;
int err;
@@ -974,7 +974,7 @@ asmlinkage long sys_semctl (int semid, i
return err;
case IPC_RMID:
case IPC_SET:
-   err = semctl_down(ns,semid,semnum,cmd,version,arg);
+   err = semctl_down(ns, semid, cmd, version, arg);
return err;
default:
return -EINVAL;

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc8-mm1 03/15] IPC/message queues: introduce msgctl_down

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Currently, sys_msgctl is not easy to read.
This patch tries to improve that by introducing the msgctl_down function
to handle all commands requiring the rwmutex to be taken in write mode
(ie IPC_SET and IPC_RMID for now). It is the equivalent function of
semctl_down for message queues.

This greatly changes the readability of sys_msgctl and also harmonizes
the way these commands are handled among all IPCs.


Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/msg.c |  162 ++
 1 file changed, 89 insertions(+), 73 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -399,10 +399,95 @@ copy_msqid_from_user(struct msq_setbuf *
}
 }
 
-asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
+/*
+ * This function handles some msgctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
+static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
+  struct msqid_ds __user *buf, int version)
 {
struct kern_ipc_perm *ipcp;
-   struct msq_setbuf uninitialized_var(setbuf);
+   struct msq_setbuf setbuf;
+   struct msg_queue *msq;
+   int err;
+
+   if (cmd == IPC_SET) {
+   if (copy_msqid_from_user(&setbuf, buf, version))
+   return -EFAULT;
+   }
+
+   down_write(&msg_ids(ns).rw_mutex);
+   msq = msg_lock_check_down(ns, msqid);
+   if (IS_ERR(msq)) {
+   err = PTR_ERR(msq);
+   goto out_up;
+   }
+
+   ipcp = &msq->q_perm;
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
+setbuf.mode);
+   if (err)
+   goto out_unlock;
+   }
+
+   if (current->euid != ipcp->cuid &&
+   current->euid != ipcp->uid &&
+   !capable(CAP_SYS_ADMIN)) {
+   /* We _could_ check for CAP_CHOWN above, but we don't */
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   err = security_msg_queue_msgctl(msq, cmd);
+   if (err)
+   goto out_unlock;
+
+   switch (cmd) {
+   case IPC_RMID:
+   freeque(ns, ipcp);
+   goto out_up;
+   case IPC_SET:
+   if (setbuf.qbytes > ns->msg_ctlmnb &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   msq->q_qbytes = setbuf.qbytes;
+
+   ipcp->uid = setbuf.uid;
+   ipcp->gid = setbuf.gid;
+   ipcp->mode = (ipcp->mode & ~S_IRWXUGO) |
+(S_IRWXUGO & setbuf.mode);
+   msq->q_ctime = get_seconds();
+   /* sleeping receivers might be excluded by
+* stricter permissions.
+*/
+   expunge_all(msq, -EAGAIN);
+   /* sleeping senders might be able to send
+* due to a larger queue size.
+*/
+   ss_wakeup(&msq->q_senders, 0);
+   break;
+   default:
+   err = -EINVAL;
+   }
+out_unlock:
+   msg_unlock(msq);
+out_up:
+   up_write(&msg_ids(ns).rw_mutex);
+   return err;
+}
+
+asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
+{
struct msg_queue *msq;
int err, version;
struct ipc_namespace *ns;
@@ -498,82 +583,13 @@ asmlinkage long sys_msgctl(int msqid, in
return success_return;
}
case IPC_SET:
-   if (!buf)
-   return -EFAULT;
-   if (copy_msqid_from_user(&setbuf, buf, version))
-   return -EFAULT;
-   break;
case IPC_RMID:
-   break;
+   err = msgctl_down(ns, msqid, cmd, buf, version);
+   return err;
default:
return  -EINVAL;
}
 
-   down_write(&msg_ids(ns).rw_mutex);
-   msq = msg_lock_check_down(ns, msqid);
-   if (IS_ERR(msq)) {
-   err = PTR_ERR(msq);
-   goto out_up;
-   }
-
-   ipcp = &msq->q_perm;
-
-   err = audit_ipc_obj(ipcp);
-   if (err)
-   goto out_unlock_up;
-   if (cmd == IPC_SET) {
-   err = audit_ipc_set_perm(setbuf.qbytes, setbuf.uid, setbuf.gid,
-setbuf.mod

[PATCH 2.6.24-rc8-mm1 02/15] IPC/shared memory: introduce shmctl_down

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Currently, the way the different commands are handled in sys_shmctl
introduces some duplicated code.
This patch introduces the shmctl_down function to handle all the commands
requiring the rwmutex to be taken in write mode (ie IPC_SET and IPC_RMID
for now). It is the equivalent function of semctl_down for shared
memory.

This removes some duplicated code for handling these both commands
and harmonizes the way they are handled among all IPCs.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/shm.c |  160 +++---
 1 file changed, 72 insertions(+), 88 deletions(-)

Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -625,10 +625,78 @@ static void shm_get_stat(struct ipc_name
}
 }
 
-asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
+/*
+ * This function handles some shmctl commands which require the rw_mutex
+ * to be held in write mode.
+ * NOTE: no locks must be held, the rw_mutex is taken inside this function.
+ */
+static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+  struct shmid_ds __user *buf, int version)
 {
+   struct kern_ipc_perm *ipcp;
struct shm_setbuf setbuf;
struct shmid_kernel *shp;
+   int err;
+
+   if (cmd == IPC_SET) {
+   if (copy_shmid_from_user(&setbuf, buf, version))
+   return -EFAULT;
+   }
+
+   down_write(&shm_ids(ns).rw_mutex);
+   shp = shm_lock_check_down(ns, shmid);
+   if (IS_ERR(shp)) {
+   err = PTR_ERR(shp);
+   goto out_up;
+   }
+
+   ipcp = &shp->shm_perm;
+
+   err = audit_ipc_obj(ipcp);
+   if (err)
+   goto out_unlock;
+
+   if (cmd == IPC_SET) {
+   err = audit_ipc_set_perm(0, setbuf.uid,
+setbuf.gid, setbuf.mode);
+   if (err)
+   goto out_unlock;
+   }
+
+   if (current->euid != ipcp->uid &&
+   current->euid != ipcp->cuid &&
+   !capable(CAP_SYS_ADMIN)) {
+   err = -EPERM;
+   goto out_unlock;
+   }
+
+   err = security_shm_shmctl(shp, cmd);
+   if (err)
+   goto out_unlock;
+   switch (cmd) {
+   case IPC_RMID:
+   do_shm_rmid(ns, ipcp);
+   goto out_up;
+   case IPC_SET:
+   ipcp->uid = setbuf.uid;
+   ipcp->gid = setbuf.gid;
+   ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
+   | (setbuf.mode & S_IRWXUGO);
+   shp->shm_ctim = get_seconds();
+   break;
+   default:
+   err = -EINVAL;
+   }
+out_unlock:
+   shm_unlock(shp);
+out_up:
+   up_write(&shm_ids(ns).rw_mutex);
+   return err;
+}
+
+asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf)
+{
+   struct shmid_kernel *shp;
int err, version;
struct ipc_namespace *ns;
 
@@ -784,97 +852,13 @@ asmlinkage long sys_shmctl (int shmid, i
goto out;
}
case IPC_RMID:
-   {
-   /*
-*  We cannot simply remove the file. The SVID states
-*  that the block remains until the last person
-*  detaches from it, then is deleted. A shmat() on
-*  an RMID segment is legal in older Linux and if 
-*  we change it apps break...
-*
-*  Instead we set a destroyed flag, and then blow
-*  the name away when the usage hits zero.
-*/
-   down_write(&shm_ids(ns).rw_mutex);
-   shp = shm_lock_check_down(ns, shmid);
-   if (IS_ERR(shp)) {
-   err = PTR_ERR(shp);
-   goto out_up;
-   }
-
-   err = audit_ipc_obj(&(shp->shm_perm));
-   if (err)
-   goto out_unlock_up;
-
-   if (current->euid != shp->shm_perm.uid &&
-   current->euid != shp->shm_perm.cuid && 
-   !capable(CAP_SYS_ADMIN)) {
-   err=-EPERM;
-   goto out_unlock_up;
-   }
-
-   err = security_shm_shmctl(shp, cmd);
-   if (err)
-   goto out_unlock_up;
-
-   do_shm_rmid(ns, &shp->shm_perm);
-   up_write(&shm_ids(ns).rw_mutex);
-   goto out;
-   }
-
case IPC_SET:
-   {
-   if (!buf) {
-   err = -EFAULT;
-

[PATCH 2.6.24-rc8-mm1 01/15] IPC/semaphores: code factorisation

2008-01-29 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Trivial patch which adds some small locking functions and makes use of them
to factorize some part of code and makes it cleaner.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 ipc/sem.c |   61 +++--
 1 file changed, 31 insertions(+), 30 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -181,6 +181,25 @@ static inline struct sem_array *sem_lock
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline void sem_lock_and_putref(struct sem_array *sma)
+{
+   ipc_lock_by_ptr(&sma->sem_perm);
+   ipc_rcu_putref(sma);
+}
+
+static inline void sem_getref_and_unlock(struct sem_array *sma)
+{
+   ipc_rcu_getref(sma);
+   ipc_unlock(&(sma)->sem_perm);
+}
+
+static inline void sem_putref(struct sem_array *sma)
+{
+   ipc_lock_by_ptr(&sma->sem_perm);
+   ipc_rcu_putref(sma);
+   ipc_unlock(&(sma)->sem_perm);
+}
+
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
 {
ipc_rmid(&sem_ids(ns), &s->sem_perm);
@@ -700,19 +719,15 @@ static int semctl_main(struct ipc_namesp
int i;
 
if(nsems > SEMMSL_FAST) {
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return -ENOMEM;
}
 
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
+   sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma);
err = -EIDRM;
@@ -733,38 +748,30 @@ static int semctl_main(struct ipc_namesp
int i;
struct sem_undo *un;
 
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
if(nsems > SEMMSL_FAST) {
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return -ENOMEM;
}
}
 
if (copy_from_user (sem_io, arg.array, nsems*sizeof(ushort))) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
err = -EFAULT;
goto out_free;
}
 
for (i = 0; i < nsems; i++) {
if (sem_io[i] > SEMVMX) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
err = -ERANGE;
goto out_free;
}
}
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
+   sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma);
err = -EIDRM;
@@ -1044,14 +1051,11 @@ static struct sem_undo *find_undo(struct
return ERR_PTR(PTR_ERR(sma));
 
nsems = sma->sem_nsems;
-   ipc_rcu_getref(sma);
-   sem_unlock(sma);
+   sem_getref_and_unlock(sma);
 
new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, 
GFP_KERNEL);
if (!new) {
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_unlock(sma);
+   sem_putref(sma);
return ERR_PTR(-ENOMEM);
}
new->semadj = (short *) &new[1];
@@ -1062,13 +1066,10 @@ static struct sem_undo *find_undo(struct
if (un) {
spin_unlock(&ulp->lock);
kfree(new);
-   ipc_lock_by_ptr(&sma->sem_perm);
-   ipc_rcu_putref(sma);
-   sem_u

[PATCH 2.6.24-rc8-mm1 00/15] IPC: code rewrite + new functionalities

2008-01-29 Thread pierre . peiffer

Hi,

Here is a patchset about the IPC, which proposes to consolidate some
part of the existing code and to add some functionalities.

* Patches 1 to 8 don't change the existing behavior, but propose to rewrite
some parts of the existing code. In fact, the three kinds of IPC (semaphores,
message queues and shared memory) have some common commands (IPC_SET, IPC_RMID,
etc...) but they are mainly handled in three different ways. These patches
propose to consolidate this, by handling these commands the same way and try
to use, as much as possible, some common code. This should increase
readability and maintainability of the code, making them probably good
candidate for the -mm tree, I think.

* Patches 9 to 15 propose to add some functionalities, and thus are
submitted here for RFC, about both the interest and their implementation.
These functionalities are:
- Two new control-commands:
. IPC_SETID: to change an IPC's id.
. IPC_SETALL: behaves as IPC_SET, except that it also sets all time
  and pid values)
- add a /proc//semundo file to read and write the undo values of
some semaphores for a given process.

As the namespaces and the "containers" are being integrated in the
kernel, these functionalities may be a first step to implement  the
checkpoint/restart of an application: in fact the existing API does not allow
to specify or to change an ID when creating an IPC, when restarting an
application, and the times/pids values of each IPCs are also altered. May be
someone may find another interest about this ?

So again, comments are welcome.

Thanks.

-- 
Pierre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm] new warning in ipc/msg.c

2008-01-10 Thread Pierre Peiffer

Andrew Morton wrote:

> Doing this in a piecemeal through-a-pinhole fashion won't work very well
> and is a bit risky.

Yes, agree, that's also my feeling.

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm] new warning in ipc/msg.c

2008-01-09 Thread Pierre Peiffer

Hi,

This very small patch:
ipc-convert-handmade-min-to-min.patch
introduces a new warning when compiling the -mm kernel:

.../linux-2.6.24-rc6-mm1/ipc/msg.c: In function `do_msgrcv':
.../linux-2.6.24-rc6-mm1/ipc/msg.c:939: warning: comparison of distinct pointer
types lacks a cast

I don't know if doing in include/linux/msg.h

struct msg_msg {
struct list_head m_list;
long  m_type;
-   int m_ts;   /* message text size */
+   size_t m_ts;   /* message text size */
struct msg_msgseg* next;
void *security;
/* the actual message follows immediately */
};

is acceptable ?

Otherwise, either a cast can be added or this patch can be dropped...

Thanks,

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc5-mm1

2007-12-13 Thread Pierre Peiffer

-do-not-stop-start-devices-in-suspend-resume-path.patch
> 
>  PNP fix
> 
> -pnp-request-ioport-and-iomem-resources-used-by-active-devices.patch
> 
>  Dropped for now.
> 
> +ext-fix-comment-for-nonexistent-variable.patch
> +ext-use-ext_get_group_desc.patch
> +ext-remove-unused-argument-for-ext_find_goal.patch
> +ext-cleanup-ext_bg_num_gdb.patch
> 
>  ext2/3/4 cleanups
> 
> +per-zone-and-reclaim-enhancements-for-memory-controller-take-3-modifies-vmscanc-for-isolate-globa-cgroup-lru-activity-fix-accounting-in-vmscanc-for-memory-controller.patch
> +update-documentation-controller-memorytxt.patch
> 
>  memory controller updates
> 
> +drivers-dma-iop-admac-use-list_head-instead-of-list_head_init.patch
> 
>  DMS driver cleanup
> 
> +proc-seqfile-convert-proc_pid_status-to-properly-handle-pid-namespaces-fix-2.patch
> +proc-seqfile-convert-proc_pid_status-to-properly-handle-pid-namespaces-fix-3.patch
> 
>  Fix
>  proc-seqfile-convert-proc_pid_status-to-properly-handle-pid-namespaces.patch
>  even more
> 
> +fix-group-stop-with-exit-race.patch
> +sys_setsid-remove-now-unneeded-session-=-1-check.patch
> +move-the-related-code-from-exit_notify-to-exit_signals.patch
> +pid-sys_wait-fixes-v2.patch
> +pid-sys_wait-fixes-v2-checkpatch-fixes.patch
> +pid-extend-fix-pid_vnr.patch
> +sys_getsid-dont-use-nsproxy-directly.patch
> +pid-fix-mips-irix-emulation-pid-usage.patch
> +pid-fix-solaris_procids.patch
> +uglify-kill_pid_info-to-fix-kill-vs-exec-race.patch
> +uglify-while_each_pid_task-to-make-sure-we-dont-count-the-execing-pricess-twice.patch
> +itimer_real-convert-to-use-struct-pid.patch
> 
>  Core kernel updates
> 
> +rd-support-xip.patch
> 
>  SUpport XIP in rd.c
> 
> -cramfs-make-cramfs-little-endian-only.patch
> -cramfs-make-cramfs-little-endian-only-update.patch
> -cramfs-make-cramfs-little-endian-only-fix.patch
> 
>  Dropped
> 
> 
> 5041 commits in 1616 patch files
> 
> All patches:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc5/2.6.24-rc5-mm1/patch-list
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Pierre Peiffer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Remove one useless extern declaration

2007-11-29 Thread Pierre Peiffer



The file exit.c contains one useless extern declaration of sem_exit().
Moreover, it refers to nothing.

This trivial patch removes it.


Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 kernel/exit.c |2 --
 1 file changed, 2 deletions(-)

Index: b/kernel/exit.c
===
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -50,8 +50,6 @@
 #include 
 #include 
 
-extern void sem_exit (void);
-
 static void exit_mm(struct task_struct * tsk);
 
 static void __unhash_process(struct task_struct *p)

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc3-mm1 2/3] IPC/semaphores: consolidate SEM_STAT and IPC_STAT commands

2007-11-27 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

These both commands (SEM_STAT and IPC_STAT) are rather doing the same things
(only the meaning of the id given as input and the return value differ).
However, for the semaphores, they are handled in two different places
(two different functions).

This patch consolidates this for clarification by handling these both
commands in semctl_nolock(). It also removes one unused parameter for
this function.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 ipc/sem.c |   38 --
 1 file changed, 16 insertions(+), 22 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -599,8 +599,8 @@ static unsigned long copy_semid_to_user(
}
 }
 
-static int semctl_nolock(struct ipc_namespace *ns, int semid, int semnum,
-   int cmd, int version, union semun arg)
+static int semctl_nolock(struct ipc_namespace *ns, int semid,
+int cmd, int version, union semun arg)
 {
int err = -EINVAL;
struct sem_array *sma;
@@ -639,14 +639,23 @@ static int semctl_nolock(struct ipc_name
return -EFAULT;
return (max_id < 0) ? 0: max_id;
}
+   case IPC_STAT:
case SEM_STAT:
{
struct semid64_ds tbuf;
int id;
 
-   sma = sem_lock(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
+   if (cmd == SEM_STAT) {
+   sma = sem_lock(ns, semid);
+   if (IS_ERR(sma))
+   return PTR_ERR(sma);
+   id = sma->sem_perm.id;
+   } else {
+   sma = sem_lock_check(ns, semid);
+   if (IS_ERR(sma))
+   return PTR_ERR(sma);
+   id = 0;
+   }
 
err = -EACCES;
if (ipcperms (&sma->sem_perm, S_IRUGO))
@@ -656,8 +665,6 @@ static int semctl_nolock(struct ipc_name
if (err)
goto out_unlock;
 
-   id = sma->sem_perm.id;
-
memset(&tbuf, 0, sizeof(tbuf));
 
kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
@@ -792,19 +799,6 @@ static int semctl_main(struct ipc_namesp
err = 0;
goto out_unlock;
}
-   case IPC_STAT:
-   {
-   struct semid64_ds tbuf;
-   memset(&tbuf,0,sizeof(tbuf));
-   kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
-   tbuf.sem_otime  = sma->sem_otime;
-   tbuf.sem_ctime  = sma->sem_ctime;
-   tbuf.sem_nsems  = sma->sem_nsems;
-   sem_unlock(sma);
-   if (copy_semid_to_user (arg.buf, &tbuf, version))
-   return -EFAULT;
-   return 0;
-   }
/* GETVAL, GETPID, GETNCTN, GETZCNT, SETVAL: fall-through */
}
err = -EINVAL;
@@ -971,15 +965,15 @@ asmlinkage long sys_semctl (int semid, i
switch(cmd) {
case IPC_INFO:
case SEM_INFO:
+   case IPC_STAT:
case SEM_STAT:
-   err = semctl_nolock(ns,semid,semnum,cmd,version,arg);
+   err = semctl_nolock(ns, semid, cmd, version, arg);
return err;
case GETALL:
case GETVAL:
case GETPID:
case GETNCNT:
case GETZCNT:
-   case IPC_STAT:
case SETVAL:
case SETALL:
        err = semctl_main(ns,semid,semnum,cmd,version,arg);

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc3-mm1 3/3] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()

2007-11-27 Thread pierre . peiffer

sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an 
ipc_namespace is
released to free all ipcs of each type.
But in fact, they do the same thing: they loop around all ipcs to free them
individually by calling a specific routine.

This patch proposes to consolidate this by introducing a common function, 
free_ipcs(),
that do the job. The specific routine to call on each individual ipcs is passed 
as
parameter. For this, these ipc-specific 'free' routines are reworked to take a
generic 'struct ipc_perm' as parameter.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 include/linux/ipc_namespace.h |5 -
 ipc/msg.c |   28 +---
 ipc/namespace.c   |   30 ++
 ipc/sem.c |   27 +--
 ipc/shm.c |   27 ++-
 5 files changed, 50 insertions(+), 67 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,7 @@ struct msg_sender {
 #define msg_unlock(msq)ipc_unlock(&(msq)->q_perm)
 #define msg_buildid(id, seq)   ipc_buildid(id, seq)
 
-static void freeque(struct ipc_namespace *, struct msg_queue *);
+static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -91,26 +91,7 @@ void msg_init_ns(struct ipc_namespace *n
 #ifdef CONFIG_IPC_NS
 void msg_exit_ns(struct ipc_namespace *ns)
 {
-   struct msg_queue *msq;
-   struct kern_ipc_perm *perm;
-   int next_id;
-   int total, in_use;
-
-   down_write(&msg_ids(ns).rw_mutex);
-
-   in_use = msg_ids(ns).in_use;
-
-   for (total = 0, next_id = 0; total < in_use; next_id++) {
-   perm = idr_find(&msg_ids(ns).ipcs_idr, next_id);
-   if (perm == NULL)
-   continue;
-   ipc_lock_by_ptr(perm);
-   msq = container_of(perm, struct msg_queue, q_perm);
-   freeque(ns, msq);
-   total++;
-   }
-
-   up_write(&msg_ids(ns).rw_mutex);
+   free_ipcs(ns, &msg_ids(ns), freeque);
 }
 #endif
 
@@ -274,9 +255,10 @@ static void expunge_all(struct msg_queue
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct msg_queue *msq)
+static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
struct list_head *tmp;
+   struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
 
expunge_all(msq, -EIDRM);
ss_wakeup(&msq->q_senders, 1);
@@ -582,7 +564,7 @@ asmlinkage long sys_msgctl(int msqid, in
break;
}
case IPC_RMID:
-   freeque(ns, msq);
+   freeque(ns, &msq->q_perm);
break;
}
err = 0;
Index: b/ipc/namespace.c
===
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -44,6 +44,36 @@ struct ipc_namespace *copy_ipcs(unsigned
return new_ns;
 }
 
+/*
+ * free_ipcs - free all ipcs of one type
+ * @ns:   the namespace to remove the ipcs from
+ * @ids:  the table of ipcs to free
+ * @free: the function called to free each individual ipc
+ *
+ * Called for each kind of ipc when an ipc_namespace exits.
+ */
+void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
+  void (*free)(struct ipc_namespace *, struct kern_ipc_perm *))
+{
+   struct kern_ipc_perm *perm;
+   int next_id;
+   int total, in_use;
+
+   down_write(&ids->rw_mutex);
+
+   in_use = ids->in_use;
+
+   for (total = 0, next_id = 0; total < in_use; next_id++) {
+   perm = idr_find(&ids->ipcs_idr, next_id);
+   if (perm == NULL)
+   continue;
+   ipc_lock_by_ptr(perm);
+   free(ns, perm);
+   total++;
+   }
+   up_write(&ids->rw_mutex);
+}
+
 void free_ipc_ns(struct kref *kref)
 {
struct ipc_namespace *ns;
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -94,7 +94,7 @@
 #define sem_buildid(id, seq)   ipc_buildid(id, seq)
 
 static int newary(struct ipc_namespace *, struct ipc_params *);
-static void freeary(struct ipc_namespace *, struct sem_array *);
+static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -129,25 +129,7 @@ void sem_init_ns(struct ipc_namespace *n
 #ifdef CONFIG_IPC_NS

[PATCH 2.6.24-rc3-mm1 0/3] [resend] IPC: some code consolidation

2007-11-27 Thread pierre . peiffer

Andrew,

Following this discussion http://lkml.org/lkml/2007/11/27/54, I
resend the three patches that I've sent last friday to let you have all of
them in the right order.

Thanks,
-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc3-mm1 1/3] IPC: make struct ipc_ids static in ipc_namespace

2007-11-27 Thread pierre . peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
msg, sem and shm, structure used to store each ipcs)
These pointers are dynamically allocated for each icp_namespace as the
ipc_namespace itself (for the init namespace, they are initialized with
pointer to static variables instead)

It is so for historical reason: in fact, before the use of idr to store the
ipcs, the ipcs were stored in tables of variable length, depending of the
maximum number of ipcs allowed.
Now, these 'struct ipc_ids' have a fixed size. As they are allocated in any
cases for each new ipc_namespace, there is no gain of memory in having them
allocated separately of the struct ipc_namespace.

This patch proposes to make this table static in the struct ipc_namespace.
Thus, we can allocate all in once and get rid of all the code needed to
allocate and free these ipc_ids separately.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
Acked-by: Cedric Le Goater <[EMAIL PROTECTED]>
Acked-by: Pavel Emelyanov <[EMAIL PROTECTED]>
---
 include/linux/ipc_namespace.h |   13 +++--
 ipc/msg.c |   26 --
 ipc/namespace.c   |   25 -
 ipc/sem.c |   26 --
 ipc/shm.c |   26 --
 ipc/util.c|6 +++---
 ipc/util.h|   16 
 7 files changed, 34 insertions(+), 104 deletions(-)

Index: b/include/linux/ipc_namespace.h
===
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -2,11 +2,20 @@
 #define __IPC_NAMESPACE_H__
 
 #include 
+#include 
+#include 
+
+struct ipc_ids {
+   int in_use;
+   unsigned short seq;
+   unsigned short seq_max;
+   struct rw_semaphore rw_mutex;
+   struct idr ipcs_idr;
+};
 
-struct ipc_ids;
 struct ipc_namespace {
struct kref kref;
-   struct ipc_ids  *ids[3];
+   struct ipc_ids  ids[3];
 
int sem_ctls[4];
int used_sems;
Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -67,9 +67,7 @@ struct msg_sender {
 #define SEARCH_NOTEQUAL3
 #define SEARCH_LESSEQUAL   4
 
-static struct ipc_ids init_msg_ids;
-
-#define msg_ids(ns)(*((ns)->ids[IPC_MSG_IDS]))
+#define msg_ids(ns)((ns)->ids[IPC_MSG_IDS])
 
 #define msg_unlock(msq)ipc_unlock(&(msq)->q_perm)
 #define msg_buildid(id, seq)   ipc_buildid(id, seq)
@@ -80,30 +78,17 @@ static int newque(struct ipc_namespace *
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
 
-static void __msg_init_ns(struct ipc_namespace *ns, struct ipc_ids *ids)
+void msg_init_ns(struct ipc_namespace *ns)
 {
-   ns->ids[IPC_MSG_IDS] = ids;
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
ns->msg_ctlmni = MSGMNI;
atomic_set(&ns->msg_bytes, 0);
atomic_set(&ns->msg_hdrs, 0);
-   ipc_init_ids(ids);
+   ipc_init_ids(&msg_ids(ns));
 }
 
 #ifdef CONFIG_IPC_NS
-int msg_init_ns(struct ipc_namespace *ns)
-{
-   struct ipc_ids *ids;
-
-   ids = kmalloc(sizeof(struct ipc_ids), GFP_KERNEL);
-   if (ids == NULL)
-   return -ENOMEM;
-
-   __msg_init_ns(ns, ids);
-   return 0;
-}
-
 void msg_exit_ns(struct ipc_namespace *ns)
 {
struct msg_queue *msq;
@@ -126,15 +111,12 @@ void msg_exit_ns(struct ipc_namespace *n
}
 
up_write(&msg_ids(ns).rw_mutex);
-
-   kfree(ns->ids[IPC_MSG_IDS]);
-   ns->ids[IPC_MSG_IDS] = NULL;
 }
 #endif
 
 void __init msg_init(void)
 {
-   __msg_init_ns(&init_ipc_ns, &init_msg_ids);
+   msg_init_ns(&init_ipc_ns);
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
Index: b/ipc/namespace.c
===
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,35 +14,18 @@
 
 static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
 {
-   int err;
struct ipc_namespace *ns;
 
-   err = -ENOMEM;
ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
if (ns == NULL)
-   goto err_mem;
+   return ERR_PTR(-ENOMEM);
 
-   err = sem_init_ns(ns);
-   if (err)
-   goto err_sem;
-   err = msg_init_ns(ns);
-   if (err)
-   goto err_msg;
-   err = shm_init_ns(ns);
-   if (err)
-   goto err_shm;
+

Re: [PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()

2007-11-27 Thread Pierre Peiffer



Andrew Morton wrote:
> On Mon, 26 Nov 2007 22:44:38 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
>> On Fri, 23 Nov 2007 17:52:50 +0100 Pierre Peiffer <[EMAIL PROTECTED]> wrote:
>>
>>> sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an 
>>> ipc_namespace is
>>> released to free all ipcs of each type.
>>> But in fact, they do the same thing: they loop around all ipcs to free them
>>> individually by calling a specific routine.
>>>
>>> This patch proposes to consolidate this by introducing a common function, 
>>> free_ipcs(),
>>> that do the job. The specific routine to call on each individual ipcs is 
>>> passed as
>>> parameter. For this, these ipc-specific 'free' routines are reworked to 
>>> take a
>>> generic 'struct ipc_perm' as parameter.
>> This conflicts in more-than-trivial ways with Pavel's
>> move-the-ipc-namespace-under-ipc_ns-option.patch, which was in
>> 2.6.24-rc3-mm1.
>>
> 
> err, no, it wasn't that patch.  For some reason your change assumes that
> msg_exit_ns() (for example) doesn't have these lines:
> 
> kfree(ns->ids[IPC_MSG_IDS]);
> ns->ids[IPC_MSG_IDS] = NULL;
> 
> in it.

Yes, in fact, I've made this patch on top of this one:
http://lkml.org/lkml/2007/11/22/49

As the patch mentioned by this previous thread was acked by Cedric and Pavel,
I've assumed that you will take both. But I've not made this clear, sorry.

-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc3-mm1] IPC: consolidate sem_exit_ns(), msg_exit_ns and shm_exit_ns()

2007-11-23 Thread Pierre Peiffer


sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an 
ipc_namespace is
released to free all ipcs of each type.
But in fact, they do the same thing: they loop around all ipcs to free them
individually by calling a specific routine.

This patch proposes to consolidate this by introducing a common function, 
free_ipcs(),
that do the job. The specific routine to call on each individual ipcs is passed 
as
parameter. For this, these ipc-specific 'free' routines are reworked to take a
generic 'struct ipc_perm' as parameter.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 include/linux/ipc_namespace.h |5 -
 ipc/msg.c |   28 +---
 ipc/namespace.c   |   30 ++
 ipc/sem.c |   27 +--
 ipc/shm.c |   27 ++-
 5 files changed, 50 insertions(+), 67 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,7 @@ struct msg_sender {
 #define msg_unlock(msq)ipc_unlock(&(msq)->q_perm)
 #define msg_buildid(id, seq)   ipc_buildid(id, seq)
 
-static void freeque(struct ipc_namespace *, struct msg_queue *);
+static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -91,26 +91,7 @@ void msg_init_ns(struct ipc_namespace *n
 #ifdef CONFIG_IPC_NS
 void msg_exit_ns(struct ipc_namespace *ns)
 {
-   struct msg_queue *msq;
-   struct kern_ipc_perm *perm;
-   int next_id;
-   int total, in_use;
-
-   down_write(&msg_ids(ns).rw_mutex);
-
-   in_use = msg_ids(ns).in_use;
-
-   for (total = 0, next_id = 0; total < in_use; next_id++) {
-   perm = idr_find(&msg_ids(ns).ipcs_idr, next_id);
-   if (perm == NULL)
-   continue;
-   ipc_lock_by_ptr(perm);
-   msq = container_of(perm, struct msg_queue, q_perm);
-   freeque(ns, msq);
-   total++;
-   }
-
-   up_write(&msg_ids(ns).rw_mutex);
+   free_ipcs(ns, &msg_ids(ns), freeque);
 }
 #endif
 
@@ -274,9 +255,10 @@ static void expunge_all(struct msg_queue
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct msg_queue *msq)
+static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
struct list_head *tmp;
+   struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
 
expunge_all(msq, -EIDRM);
ss_wakeup(&msq->q_senders, 1);
@@ -582,7 +564,7 @@ asmlinkage long sys_msgctl(int msqid, in
break;
}
case IPC_RMID:
-   freeque(ns, msq);
+   freeque(ns, &msq->q_perm);
break;
}
err = 0;
Index: b/ipc/namespace.c
===
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -44,6 +44,36 @@ struct ipc_namespace *copy_ipcs(unsigned
return new_ns;
 }
 
+/*
+ * free_ipcs - free all ipcs of one type
+ * @ns:   the namespace to remove the ipcs from
+ * @ids:  the table of ipcs to free
+ * @free: the function called to free each individual ipc
+ *
+ * Called for each kind of ipc when an ipc_namespace exits.
+ */
+void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
+  void (*free)(struct ipc_namespace *, struct kern_ipc_perm *))
+{
+   struct kern_ipc_perm *perm;
+   int next_id;
+   int total, in_use;
+
+   down_write(&ids->rw_mutex);
+
+   in_use = ids->in_use;
+
+   for (total = 0, next_id = 0; total < in_use; next_id++) {
+   perm = idr_find(&ids->ipcs_idr, next_id);
+   if (perm == NULL)
+   continue;
+   ipc_lock_by_ptr(perm);
+   free(ns, perm);
+   total++;
+   }
+   up_write(&ids->rw_mutex);
+}
+
 void free_ipc_ns(struct kref *kref)
 {
struct ipc_namespace *ns;
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -94,7 +94,7 @@
 #define sem_buildid(id, seq)   ipc_buildid(id, seq)
 
 static int newary(struct ipc_namespace *, struct ipc_params *);
-static void freeary(struct ipc_namespace *, struct sem_array *);
+static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -129,25 +129,7 @@ void sem_init_ns(struct ipc_namespace *n
 #ifdef CONFIG_IPC_NS

[PATCH 2.6.24-rc3-mm1] IPC/semaphores: consolidate SEM_STAT and IPC_STAT commands

2007-11-23 Thread Pierre Peiffer


These both commands (SEM_STAT and IPC_STAT) are rather doing the same things
(only the meaning of the id given as input and the return value differ).
However, for the semaphores, they are handled in two different places
(two different functions).

This patch consolidates this for clarification by handling these both
commands in the same place in semctl_nolock(). It also removes one
unused parameter for this function.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 ipc/sem.c |   38 --
 1 file changed, 16 insertions(+), 22 deletions(-)

Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -599,8 +599,8 @@ static unsigned long copy_semid_to_user(
}
 }
 
-static int semctl_nolock(struct ipc_namespace *ns, int semid, int semnum,
-   int cmd, int version, union semun arg)
+static int semctl_nolock(struct ipc_namespace *ns, int semid,
+int cmd, int version, union semun arg)
 {
int err = -EINVAL;
struct sem_array *sma;
@@ -639,14 +639,23 @@ static int semctl_nolock(struct ipc_name
return -EFAULT;
return (max_id < 0) ? 0: max_id;
}
+   case IPC_STAT:
case SEM_STAT:
{
struct semid64_ds tbuf;
int id;
 
-   sma = sem_lock(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
+   if (cmd == SEM_STAT) {
+   sma = sem_lock(ns, semid);
+   if (IS_ERR(sma))
+   return PTR_ERR(sma);
+   id = sma->sem_perm.id;
+   } else {
+   sma = sem_lock_check(ns, semid);
+   if (IS_ERR(sma))
+   return PTR_ERR(sma);
+   id = 0;
+   }
 
err = -EACCES;
if (ipcperms (&sma->sem_perm, S_IRUGO))
@@ -656,8 +665,6 @@ static int semctl_nolock(struct ipc_name
if (err)
goto out_unlock;
 
-   id = sma->sem_perm.id;
-
memset(&tbuf, 0, sizeof(tbuf));
 
kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
@@ -792,19 +799,6 @@ static int semctl_main(struct ipc_namesp
err = 0;
goto out_unlock;
}
-   case IPC_STAT:
-   {
-   struct semid64_ds tbuf;
-   memset(&tbuf,0,sizeof(tbuf));
-   kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
-   tbuf.sem_otime  = sma->sem_otime;
-   tbuf.sem_ctime  = sma->sem_ctime;
-   tbuf.sem_nsems  = sma->sem_nsems;
-   sem_unlock(sma);
-   if (copy_semid_to_user (arg.buf, &tbuf, version))
-   return -EFAULT;
-   return 0;
-   }
/* GETVAL, GETPID, GETNCTN, GETZCNT, SETVAL: fall-through */
}
err = -EINVAL;
@@ -971,15 +965,15 @@ asmlinkage long sys_semctl (int semid, i
switch(cmd) {
case IPC_INFO:
case SEM_INFO:
+   case IPC_STAT:
case SEM_STAT:
-   err = semctl_nolock(ns,semid,semnum,cmd,version,arg);
+   err = semctl_nolock(ns, semid, cmd, version, arg);
return err;
case GETALL:
case GETVAL:
case GETPID:
case GETNCNT:
case GETZCNT:
-   case IPC_STAT:
case SETVAL:
case SETALL:
    err = semctl_main(ns,semid,semnum,cmd,version,arg);

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc3-mm1] IPC: make struct ipc_ids static in ipc_namespace

2007-11-23 Thread Pierre Peiffer

Pavel Emelyanov wrote:
> Well I think you're right. The structure gains 50% in size... Really too
> much to fight for performance in IPC :)
> 
> Thanks for checking this thing.
> 
> You may put my Acked-by in the original patch.
> 

Cool. Thanks !

P.

> Thanks,
> Pavel
> 

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc3-mm1] IPC: make struct ipc_ids static in ipc_namespace

2007-11-23 Thread Pierre Peiffer

Ok, I have the patch ready, but before sending it, I worry about the size of
struct ipc_namespace if we mark struct ipc_ids as ___cacheline_aligned

Of course, you we fall into a classical match: performance vs memory size.

As I don't think that I have the knowledge to decide what we must focus on, here
after is, for info, the size reported by pahole (on x86, Intel Xeon)

With the patch sent at the beginning of this thread we have:

struct ipc_namespace {
struct krefkref; /* 0 4 */
struct ipc_ids ids[3];   /* 4   156 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
intsem_ctls[4];  /*   16016 */
intused_sems;/*   176 4 */
intmsg_ctlmax;   /*   180 4 */
intmsg_ctlmnb;   /*   184 4 */
intmsg_ctlmni;   /*   188 4 */
/* --- cacheline 3 boundary (192 bytes) --- */
atomic_t   msg_bytes;/*   192 4 */
atomic_t   msg_hdrs; /*   196 4 */
size_t shm_ctlmax;   /*   200 4 */
size_t shm_ctlall;   /*   204 4 */
intshm_ctlmni;   /*   208 4 */
intshm_tot;  /*   212 4 */

/* size: 216, cachelines: 4 */
/* last cacheline: 24 bytes */
};  /* definitions: 1 */

With the new patch, if we mark the struct ipc_ids as cacheline_aligned, we
have (I put kref at the end, to save one more cacheline):

struct ipc_namespace {
struct ipc_ids sem_ids;  /* 064 */

/* XXX last struct has 12 bytes of padding */

/* --- cacheline 1 boundary (64 bytes) --- */
intsem_ctls[4];  /*6416 */
intused_sems;/*80 4 */

/* XXX 44 bytes hole, try to pack */

/* --- cacheline 2 boundary (128 bytes) --- */
struct ipc_ids msg_ids;  /*   12864 */

/* XXX last struct has 12 bytes of padding */

/* --- cacheline 3 boundary (192 bytes) --- */
intmsg_ctlmax;   /*   192 4 */
intmsg_ctlmnb;   /*   196 4 */
intmsg_ctlmni;   /*   200 4 */
atomic_t   msg_bytes;/*   204 4 */
atomic_t   msg_hdrs; /*   208 4 */

/* XXX 44 bytes hole, try to pack */

/* --- cacheline 4 boundary (256 bytes) --- */
struct ipc_ids shm_ids;  /*   25664 */

/* XXX last struct has 12 bytes of padding */

/* --- cacheline 5 boundary (320 bytes) --- */
size_t shm_ctlmax;   /*   320 4 */
size_t shm_ctlall;   /*   324 4 */
intshm_ctlmni;   /*   328 4 */
intshm_tot;  /*   332 4 */
struct krefkref; /*   336 4 */

/* size: 384, cachelines: 6 */
/* sum members: 252, holes: 2, sum holes: 88 */
/* padding: 44 */
/* paddings: 3, sum paddings: 36 */
};  /* definitions: 1 */

We can put all sysctl related values together, in one cacheline and keep ipc_ids
cacheline aligned ? But I really wonder about the performance gain here...

Humm humm, comment ?

P.

Pavel Emelyanov wrote:
> Pierre Peiffer wrote:
>> Hi,
>>
>>  Thanks for reviewing this !
>>
>> Pavel Emelyanov wrote:
>>> Pavel Emelyanov wrote:
>>>> Cedric Le Goater wrote:
>>>>> Pierre Peiffer wrote:
>>> [snip]
>>>
>>>>> Pavel, what do you think of it ? 
>>>> Looks sane, good catch, Pierre.
>>>>
>>>> But I'd find out whether these three ipc_ids intersect any 
>>>> cache-line. In other words I'd mark the struct ipc_ids as
>>>> cacheline_aligned and checked for any differences.
>>> BTW! It might be also useful to keep ipc_ids closer to their
>>> sysctl parameters.
>>>
>> It makes sense indeed.
>>
>> That would mean to have something like this, right ?
> 
> Yup :)
> 
>> struct ipc_namespace {
>>  struct kref kref;
>>
>>  struct ipc_ids  sem_ids;
>>  int sem_ctls[4];
>>  int used_se

Re: [PATCH 2.6.24-rc3-mm1] IPC: make struct ipc_ids static in ipc_namespace

2007-11-23 Thread Pierre Peiffer

Hi,

Thanks for reviewing this !

Pavel Emelyanov wrote:
> Pavel Emelyanov wrote:
>> Cedric Le Goater wrote:
>>> Pierre Peiffer wrote:
> 
> [snip]
> 
>>> Pavel, what do you think of it ? 
>> Looks sane, good catch, Pierre.
>>
>> But I'd find out whether these three ipc_ids intersect any 
>> cache-line. In other words I'd mark the struct ipc_ids as
>> cacheline_aligned and checked for any differences.
> 
> BTW! It might be also useful to keep ipc_ids closer to their
> sysctl parameters.
> 

It makes sense indeed.

That would mean to have something like this, right ?

struct ipc_namespace {
struct kref kref;

struct ipc_ids  sem_ids;
int sem_ctls[4];
int used_sems;

struct ipc_ids  msg_ids;
int msg_ctlmax;
int msg_ctlmnb;
int msg_ctlmni;
atomic_tmsg_bytes;
atomic_tmsg_hdrs;

struct ipc_ids  shm_ids;
size_t  shm_ctlmax;
size_t  shm_ctlall;
int shm_ctlmni;
int shm_tot;
};

After a quick look, that implies to rework a little bit procfs... othwise, it's
not a big deal as I can see.

P.

>>> Acked-by: Cedric Le Goater <[EMAIL PROTECTED]>
>>>
>>> Thanks,
>> Thanks,
>> Pavel
>>
>>> C.
> 
> [snip]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.24-rc3-mm1] IPC: make struct ipc_ids static in ipc_namespace

2007-11-22 Thread Pierre Peiffer



Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for
msg, sem and shm, structure used to store all ipcs)
These 'struct ipc_ids' are dynamically allocated for each icp_namespace as
the ipc_namespace itself (for the init namespace, they are initialized with
pointers to static variables instead)

It is so for historical reason: in fact, before the use of idr to store the
ipcs, the ipcs were stored in tables of variable length, depending of the
maximum number of ipc allowed.
Now, these 'struct ipc_ids' have a fixed size. As they are allocated in any
cases for each new ipc_namespace, there is no gain of memory in having them
allocated separately of the struct ipc_namespace.

This patch proposes to make this table static in the struct ipc_namespace.
Thus, we can allocate all in once and get rid of all the code needed to
allocate and free these ipc_ids separately.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---
 include/linux/ipc_namespace.h |   13 +++--
 ipc/msg.c |   26 --
 ipc/namespace.c   |   25 -
 ipc/sem.c |   26 --
 ipc/shm.c |   26 --
 ipc/util.c|6 +++---
 ipc/util.h|   16 
 7 files changed, 34 insertions(+), 104 deletions(-)

Index: b/include/linux/ipc_namespace.h
===
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -2,11 +2,20 @@
 #define __IPC_NAMESPACE_H__
 
 #include 
+#include 
+#include 
+
+struct ipc_ids {
+   int in_use;
+   unsigned short seq;
+   unsigned short seq_max;
+   struct rw_semaphore rw_mutex;
+   struct idr ipcs_idr;
+};
 
-struct ipc_ids;
 struct ipc_namespace {
struct kref kref;
-   struct ipc_ids  *ids[3];
+   struct ipc_ids  ids[3];
 
int sem_ctls[4];
int used_sems;
Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -67,9 +67,7 @@ struct msg_sender {
 #define SEARCH_NOTEQUAL3
 #define SEARCH_LESSEQUAL   4
 
-static struct ipc_ids init_msg_ids;
-
-#define msg_ids(ns)(*((ns)->ids[IPC_MSG_IDS]))
+#define msg_ids(ns)((ns)->ids[IPC_MSG_IDS])
 
 #define msg_unlock(msq)ipc_unlock(&(msq)->q_perm)
 #define msg_buildid(id, seq)   ipc_buildid(id, seq)
@@ -80,30 +78,17 @@ static int newque(struct ipc_namespace *
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
 
-static void __msg_init_ns(struct ipc_namespace *ns, struct ipc_ids *ids)
+void msg_init_ns(struct ipc_namespace *ns)
 {
-   ns->ids[IPC_MSG_IDS] = ids;
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
ns->msg_ctlmni = MSGMNI;
atomic_set(&ns->msg_bytes, 0);
atomic_set(&ns->msg_hdrs, 0);
-   ipc_init_ids(ids);
+   ipc_init_ids(&ns->ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
-int msg_init_ns(struct ipc_namespace *ns)
-{
-   struct ipc_ids *ids;
-
-   ids = kmalloc(sizeof(struct ipc_ids), GFP_KERNEL);
-   if (ids == NULL)
-   return -ENOMEM;
-
-   __msg_init_ns(ns, ids);
-   return 0;
-}
-
 void msg_exit_ns(struct ipc_namespace *ns)
 {
struct msg_queue *msq;
@@ -126,15 +111,12 @@ void msg_exit_ns(struct ipc_namespace *n
}
 
up_write(&msg_ids(ns).rw_mutex);
-
-   kfree(ns->ids[IPC_MSG_IDS]);
-   ns->ids[IPC_MSG_IDS] = NULL;
 }
 #endif
 
 void __init msg_init(void)
 {
-   __msg_init_ns(&init_ipc_ns, &init_msg_ids);
+   msg_init_ns(&init_ipc_ns);
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
Index: b/ipc/namespace.c
===
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,35 +14,18 @@
 
 static struct ipc_namespace *clone_ipc_ns(struct ipc_namespace *old_ns)
 {
-   int err;
struct ipc_namespace *ns;
 
-   err = -ENOMEM;
ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
if (ns == NULL)
-   goto err_mem;
+   return ERR_PTR(-ENOMEM);
 
-   err = sem_init_ns(ns);
-   if (err)
-   goto err_sem;
-   err = msg_init_ns(ns);
-   if (err)
-   goto err_msg;
-   err = shm_init_ns(ns);
-   if (err)
-   goto err_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
kref_init(&ns->kref);

Re: [PATCH 2.6.23-mm1] Change the ida/idr_pre_get() return value to follow the kernel convention

2007-10-31 Thread Pierre Peiffer


Andrew Morton wrote:
> On Tue, 30 Oct 2007 17:13:50 +0100
> Pierre Peiffer <[EMAIL PROTECTED]> wrote:
> 
>> ida_pre_get() and idr_pre_get() currently return 0 in case of error, and 1
>> in case of success, what is not the conventional way to handle error cases.
>>
>> This patch makes both of them return 0 in case of success, and an errcode
>> otherwise, and then it changes each caller to let them return the error
>> reported instead of ENOMEM. This avoids to the callers to make any assumption
>> about the cause of the error.
>>
> 
> If we're going to do this (and really we should), then we risk quietly
> breaking out-of-tree code and we risk breaking in-tree or
> soon-to-be-in-tree code which we didn't know about.

Indeed...

> So what we should do is to rename these functions when we change their
> interfaces.
> 
> Happily, this means that we then don't need to remove the old functions -
> we can keep them there for a while as we transition everything over to the
> new functions.  It also means that we can sneak the new functions into the
> 2.6.24 stream and then merge these changes via the relevant maintainers
> rather than needing a single atomic megapatch.
> 
> 
> Although I'd much prefer just to remove the pathetic things - idr_pre_get()
> is a truly awful interface.
> 
> It would be slightly better if it took an `id' argument and filled in the
> nodes at the appropriate position in the tree so that the caller is
> guaranteed that the subsequent idr_get_new() will succeed.  Because at
> present there is no guarantee that the nodes which you preallocated with
> idr_pre_get() are still available when you do your idr_get_new(): some
> other CPU/task might have come in and used them all.

Yes, that's what I've understand, indeed.

> 
> Storage classes which need to allcoate memory at insertion time are hard:
> radix_tree_preload() gets it right in terms of robustness, but it's an
> awful lot of fuss.
> 
> IDR gets it all wrong and compounds the problem by implementing internal
> locking.  It shouldn't have done that: storage code like this should use
> only caller-provided locking.

Ok, but for that, I prefer to let the IDR maintainers see what they can do,
because I'm not familiar at all with the IDR implementation and can not focus on
that now.

So do you think that just providing a new API (something like
idr_pre_allocate(), as you say above) would be better than nothing ?
Or we just leave the code as is for now ?

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.23-mm1] Change the ida/idr_pre_get() return value to follow the kernel convention

2007-10-30 Thread Pierre Peiffer


ida_pre_get() and idr_pre_get() currently return 0 in case of error, and 1
in case of success, what is not the conventional way to handle error cases.

This patch makes both of them return 0 in case of success, and an errcode
otherwise, and then it changes each caller to let them return the error
reported instead of ENOMEM. This avoids to the callers to make any assumption
about the cause of the error.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 arch/powerpc/mm/mmu_context_64.c   |5 +++--
 block/bsg.c|4 +---
 drivers/char/drm/drm_context.c |5 +++--
 drivers/char/drm/drm_drawable.c|5 +++--
 drivers/char/tty_io.c  |5 +++--
 drivers/dca/dca-sysfs.c|7 ---
 drivers/firewire/fw-device.c   |4 ++--
 drivers/hwmon/hwmon.c  |5 +++--
 drivers/i2c/i2c-core.c |   10 ++
 drivers/infiniband/core/cm.c   |3 ++-
 drivers/infiniband/core/cma.c  |4 ++--
 drivers/infiniband/core/sa_query.c |5 +++--
 drivers/infiniband/core/ucm.c  |2 +-
 drivers/infiniband/core/ucma.c |4 ++--
 drivers/infiniband/core/uverbs_cmd.c   |5 +++--
 drivers/infiniband/hw/amso1100/c2_qp.c |2 +-
 drivers/infiniband/hw/cxgb3/iwch.h |7 ---
 drivers/infiniband/hw/ehca/ehca_cq.c   |5 +++--
 drivers/infiniband/hw/ehca/ehca_qp.c   |4 ++--
 drivers/infiniband/hw/ipath/ipath_driver.c |   10 +-
 drivers/md/dm.c|8 
 drivers/misc/tifm_core.c   |5 +++--
 drivers/mmc/core/host.c|5 +++--
 drivers/rtc/class.c|5 ++---
 drivers/scsi/lpfc/lpfc_init.c  |2 +-
 drivers/scsi/sd.c  |2 +-
 drivers/scsi/sg.c  |4 ++--
 drivers/uio/uio.c  |5 +++--
 drivers/usb/core/endpoint.c|5 +++--
 drivers/video/display/display-sysfs.c  |2 +-
 drivers/w1/slaves/w1_ds2760.c  |4 ++--
 fs/dlm/lowcomms.c  |2 +-
 fs/inotify.c   |2 +-
 fs/ocfs2/cluster/tcp.c |6 +++---
 fs/proc/generic.c  |2 +-
 fs/super.c |5 +++--
 fs/sysfs/dir.c |4 ++--
 ipc/util.c |6 ++
 kernel/posix-timers.c  |2 +-
 lib/idr.c  |   20 ++--
 net/9p/util.c  |2 +-
 net/sctp/associola.c   |5 +++--
 42 files changed, 109 insertions(+), 95 deletions(-)

Index: b/arch/powerpc/mm/mmu_context_64.c
===
--- a/arch/powerpc/mm/mmu_context_64.c
+++ b/arch/powerpc/mm/mmu_context_64.c
@@ -30,8 +30,9 @@ int init_new_context(struct task_struct 
int err;
 
 again:
-   if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL))
-   return -ENOMEM;
+   err = idr_pre_get(&mmu_context_idr, GFP_KERNEL);
+   if (err)
+   return err;
 
spin_lock(&mmu_context_lock);
err = idr_get_new_above(&mmu_context_idr, NULL, 1, &index);
Index: b/block/bsg.c
===
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -962,10 +962,8 @@ int bsg_register_queue(struct request_qu
mutex_lock(&bsg_mutex);
 
ret = idr_pre_get(&bsg_minor_idr, GFP_KERNEL);
-   if (!ret) {
-   ret = -ENOMEM;
+   if (ret)
goto unlock;
-   }
 
ret = idr_get_new(&bsg_minor_idr, bcd, &minor);
if (ret < 0)
Index: b/drivers/char/drm/drm_context.c
===
--- a/drivers/char/drm/drm_context.c
+++ b/drivers/char/drm/drm_context.c
@@ -78,9 +78,10 @@ static int drm_ctxbitmap_next(struct drm
int ret;
 
 again:
-   if (idr_pre_get(&dev->ctx_idr, GFP_KERNEL) == 0) {
+   ret = idr_pre_get(&dev->ctx_idr, GFP_KERNEL);
+   if (ret) {
DRM_ERROR("Out of memory expanding drawable idr\n");
-   return -ENOMEM;
+   return ret;
}
mutex_lock(&dev->struct_mutex);
ret = idr_get_new_above(&dev->ctx_idr, NULL,
Index: b/drivers/char/drm/drm_drawable.c
===
--- a/drivers/char/drm/drm_drawable.c
+++ b/drivers/char/drm/drm_drawable.c
@@ -48,9 +48,10 @@ int drm_adddraw(struct drm_device *dev, 
int ret;
 
 again:
-   if (idr_pre_get(&dev->drw_idr,

[RFC][PATCH] IPC: fix error check in all new xxx_lock() and xxx_exit_ns() functions

2007-10-23 Thread Pierre Peiffer

This is a resend of a patch sent few days (or weeks) ago.
It has been updated with some more corrections.

In the new implementation of the [sem|shm|msg]_lock[_check]() routines,
we use the return value of ipc_lock() in container_of() without any check.
But ipc_lock may return a errcode. The use of this errcode in container_of()
may alter this errcode, and we don't want this.

And in xxx_exit_ns, the pointer return by idr_find is of type 'struct 
kern_ipc_per'...

Today, the code will work as is because the member used in these container_of()
is the first member of its container (offset == 0), the errcode isn't changed
then. But in the general case, we can't count on this assumption and this
may lead later to a real bug if we don't correct this.

Again, the proposed solution is simple and correct. But, as pointed by Nadia, 
with this
solution, the same check will be done several times (in all sub-callers...), 
what is not
very funny/optimal...

That's why I send this as RFC.
Comments or other proposals are welcome, but there are some corrections to do 
anyway.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 ipc/msg.c |   17 ++---
 ipc/sem.c |   17 ++---
 ipc/shm.c |   20 +---
 3 files changed, 45 insertions(+), 9 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -106,6 +106,7 @@ int msg_init_ns(struct ipc_namespace *ns
 void msg_exit_ns(struct ipc_namespace *ns)
 {
struct msg_queue *msq;
+   struct kern_ipc_perm *perm;
int next_id;
int total, in_use;
 
@@ -114,10 +115,11 @@ void msg_exit_ns(struct ipc_namespace *n
in_use = msg_ids(ns).in_use;
 
for (total = 0, next_id = 0; total < in_use; next_id++) {
-   msq = idr_find(&msg_ids(ns).ipcs_idr, next_id);
-   if (msq == NULL)
+   perm = idr_find(&msg_ids(ns).ipcs_idr, next_id);
+   if (perm == NULL)
continue;
-   ipc_lock_by_ptr(&msq->q_perm);
+   ipc_lock_by_ptr(perm);
+   msq = container_of(perm, struct msg_queue, q_perm);
freeque(ns, msq);
total++;
}
@@ -145,6 +147,9 @@ static inline struct msg_queue *msg_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check_down(&msg_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct msg_queue *)ipcp;
+
return container_of(ipcp, struct msg_queue, q_perm);
 }
 
@@ -156,6 +161,9 @@ static inline struct msg_queue *msg_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock(&msg_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct msg_queue *)ipcp;
+
return container_of(ipcp, struct msg_queue, q_perm);
 }
 
@@ -164,6 +172,9 @@ static inline struct msg_queue *msg_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(&msg_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct msg_queue *)ipcp;
+
return container_of(ipcp, struct msg_queue, q_perm);
 }
 
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -143,6 +143,7 @@ int sem_init_ns(struct ipc_namespace *ns
 void sem_exit_ns(struct ipc_namespace *ns)
 {
struct sem_array *sma;
+   struct kern_ipc_perm *perm;
int next_id;
int total, in_use;
 
@@ -151,10 +152,11 @@ void sem_exit_ns(struct ipc_namespace *n
in_use = sem_ids(ns).in_use;
 
for (total = 0, next_id = 0; total < in_use; next_id++) {
-   sma = idr_find(&sem_ids(ns).ipcs_idr, next_id);
-   if (sma == NULL)
+   perm = idr_find(&sem_ids(ns).ipcs_idr, next_id);
+   if (perm == NULL)
continue;
-   ipc_lock_by_ptr(&sma->sem_perm);
+   ipc_lock_by_ptr(perm);
+   sma = container_of(perm, struct sem_array, sem_perm);
freeary(ns, sma);
total++;
}
@@ -181,6 +183,9 @@ static inline struct sem_array *sem_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check_down(&sem_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
@@ -192,6 +197,9 @@ static inline struct sem_array *sem_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock(&sem_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
@@ -200,6 +208,9 @@ static inline struct sem_array *sem_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(&sem_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct sem_a

[RFC][PATCH -mm] IPC: fix error checking in all new xxx_lock() functions

2007-10-11 Thread Pierre Peiffer


In the new implementation of the [sem|shm|msg]_lock[_check]() routines,
we use the return value of ipc_lock() in container_of() without any check.
But ipc_lock may return a errcode. The use of this errcode in container_of()
may alter this errcode, and we don't want this.

Today, there is no problem because the member used in these container_of()
is the first member of its container (offset == 0), the errcode isn't changed
then. But in the general case, we can't count on this assumption and this
may lead later to a real bug if we don't correct this.

In fact, the proposed solution is simple and correct. But it has the drawback
of adding one more check ('if' statement) in the chain: we do a first check in
ipc_lock(), now in xxx_lock() and then one later in the caller of xxx_lock()
That's why I send this as RFC, may be another approach could be considered.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 ipc/msg.c |6 ++
 ipc/sem.c |6 ++
 ipc/shm.c |6 ++
 3 files changed, 18 insertions(+)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -140,6 +140,9 @@ static inline struct msg_queue *msg_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock(&msg_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct msg_queue *)ipcp;
+
return container_of(ipcp, struct msg_queue, q_perm);
 }
 
@@ -148,6 +151,9 @@ static inline struct msg_queue *msg_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(&msg_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct msg_queue *)ipcp;
+
return container_of(ipcp, struct msg_queue, q_perm);
 }
 
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -178,6 +178,9 @@ static inline struct sem_array *sem_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock(&sem_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
@@ -186,6 +189,9 @@ static inline struct sem_array *sem_lock
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(&sem_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -145,6 +145,9 @@ static inline struct shmid_kernel *shm_l
 {
struct kern_ipc_perm *ipcp = ipc_lock(&shm_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct shmid_kernel *)ipcp;
+
return container_of(ipcp, struct shmid_kernel, shm_perm);
 }
 
@@ -153,6 +156,9 @@ static inline struct shmid_kernel *shm_l
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(&shm_ids(ns), id);
 
+   if (IS_ERR(ipcp))
+   return (struct shmid_kernel *)ipcp;
+
return container_of(ipcp, struct shmid_kernel, shm_perm);
 }
 


Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] IPC: fix error case when idr-cache is empty in ipcget()

2007-10-11 Thread Pierre Peiffer

I resend this patch, by taking into account Nadia's remarks.

With the use of idr to store the ipc, the case where the idr cache is
empty, when idr_get_new is called (this may happen even if we call
idr_pre_get() before), is not well handled: it lets semget()/shmget()/msgget()
return ENOSPC when this cache is empty, what 1. does not reflect the facts
and 2. does not conform to the man(s).

This patch fixes this by retrying the whole process of allocation in this case.

This patch applies on top of 2.6.23-rc8-mm2 and should probably be merged
in 2.6.24 if Nadia's patches are included.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 ipc/msg.c  |4 ++--
 ipc/sem.c  |4 ++--
 ipc/shm.c  |5 +++--
 ipc/util.c |   16 +++-
 4 files changed, 18 insertions(+), 11 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -188,10 +188,10 @@ static int newque(struct ipc_namespace *
 * ipc_addid() locks msq
 */
id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
-   if (id == -1) {
+   if (id < 0) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
-   return -ENOSPC;
+   return id;
}
 
msq->q_perm.id = msg_buildid(ns, id, msq->q_perm.seq);
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -269,10 +269,10 @@ static int newary(struct ipc_namespace *
}
 
id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
-   if(id == -1) {
+   if (id < 0) {
security_sem_free(sma);
ipc_rcu_putref(sma);
-   return -ENOSPC;
+   return id;
}
ns->used_sems += nsems;
 
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -409,10 +409,11 @@ static int newseg(struct ipc_namespace *
if (IS_ERR(file))
goto no_file;
 
-   error = -ENOSPC;
id = shm_addid(ns, shp);
-   if(id == -1) 
+   if (id < 0) {
+   error = id;
goto no_id;
+   }
 
shp->shm_cprid = task_tgid_vnr(current);
shp->shm_lprid = 0;
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -261,7 +261,7 @@ int ipc_get_maxid(struct ipc_ids *ids)
  * Add an entry 'new' to the IPC ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
- * On failure the entry is not locked and -1 is returned.
+ * On failure the entry is not locked and a negative err-code is returned.
  *
  * Called with ipc_ids.mutex held.
  */
@@ -274,11 +274,11 @@ int ipc_addid(struct ipc_ids* ids, struc
size = IPCMNI;
 
if (ids->in_use >= size)
-   return -1;
+   return -ENOSPC;
 
err = idr_get_new(&ids->ipcs_idr, new, &id);
if (err)
-   return -1;
+   return err;
 
ids->in_use++;
 
@@ -310,7 +310,7 @@ int ipcget_new(struct ipc_namespace *ns,
struct ipc_ops *ops, struct ipc_params *params)
 {
int err;
-
+retry:
err = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
 
if (!err)
@@ -320,6 +320,9 @@ int ipcget_new(struct ipc_namespace *ns,
err = ops->getnew(ns, params);
mutex_unlock(&ids->mutex);
 
+   if (err == -EAGAIN)
+   goto retry;
+
return err;
 }
 
@@ -373,7 +376,7 @@ int ipcget_public(struct ipc_namespace *
struct kern_ipc_perm *ipcp;
int flg = params->flg;
int err;
-
+retry:
err = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
 
mutex_lock(&ids->mutex);
@@ -406,6 +409,9 @@ int ipcget_public(struct ipc_namespace *
    }
mutex_unlock(&ids->mutex);
 
+   if (err == -EAGAIN)
+   goto retry;
+
return err;
 }

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] IPC: fix error case when idr-cache is empty in ipcget()

2007-10-10 Thread Pierre Peiffer

With the use of idr to store the ipc, the case where the idr cache is
empty, when idr_get_new is called (this may happen even if we call
idr_pre_get() before), is not well handled: it lets semget()/shmget()/msgget()
return ENOSPC when this cache is empty, what 1. does not reflect the facts
and 2. does not conform to the man(s).

This patch fixes this by retrying the whole process of allocation in this case.

Note: we could directly return ENOMEM if idr_pre_get() fails, but it does not
mean that the cache is empty...

This patch applies on top of 2.6.23-rc8-mm2 and should probably be merged
in 2.6.24 if Nadia's patches are included.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 ipc/msg.c  |4 ++--
 ipc/sem.c  |4 ++--
 ipc/shm.c  |5 +++--
 ipc/util.c |   35 +--
 4 files changed, 28 insertions(+), 20 deletions(-)

Index: b/ipc/msg.c
===
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -188,10 +188,10 @@ static int newque(struct ipc_namespace *
 * ipc_addid() locks msq
 */
id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
-   if (id == -1) {
+   if (id < 0) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
-   return -ENOSPC;
+   return id;
}
 
msq->q_perm.id = msg_buildid(ns, id, msq->q_perm.seq);
Index: b/ipc/sem.c
===
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -269,10 +269,10 @@ static int newary(struct ipc_namespace *
}
 
id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
-   if(id == -1) {
+   if (id < 0) {
security_sem_free(sma);
ipc_rcu_putref(sma);
-   return -ENOSPC;
+   return id;
}
ns->used_sems += nsems;
 
Index: b/ipc/shm.c
===
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -409,10 +409,11 @@ static int newseg(struct ipc_namespace *
if (IS_ERR(file))
goto no_file;
 
-   error = -ENOSPC;
id = shm_addid(ns, shp);
-   if(id == -1) 
+   if (id < 0) {
+   error = id;
goto no_id;
+   }
 
shp->shm_cprid = task_tgid_vnr(current);
shp->shm_lprid = 0;
Index: b/ipc/util.c
===
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -261,7 +261,7 @@ int ipc_get_maxid(struct ipc_ids *ids)
  * Add an entry 'new' to the IPC ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
- * On failure the entry is not locked and -1 is returned.
+ * On failure the entry is not locked and a negative err-code is returned.
  *
  * Called with ipc_ids.mutex held.
  */
@@ -274,11 +274,11 @@ int ipc_addid(struct ipc_ids* ids, struc
size = IPCMNI;
 
if (ids->in_use >= size)
-   return -1;
+   return -ENOSPC;
 
err = idr_get_new(&ids->ipcs_idr, new, &id);
if (err)
-   return -1;
+   return err;
 
ids->in_use++;
 
@@ -309,17 +309,20 @@ int ipc_addid(struct ipc_ids* ids, struc
 int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
struct ipc_ops *ops, struct ipc_params *params)
 {
-   int err;
-
-   err = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
-
-   if (!err)
-   return -ENOMEM;
+   int err, alloc;
+retry:
+   alloc = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
 
mutex_lock(&ids->mutex);
err = ops->getnew(ns, params);
mutex_unlock(&ids->mutex);
 
+   if (err == -EAGAIN) {
+   if (alloc)
+   goto retry;
+   else
+   err = -ENOMEM;
+   }
return err;
 }
 
@@ -372,9 +375,9 @@ int ipcget_public(struct ipc_namespace *
 {
struct kern_ipc_perm *ipcp;
int flg = params->flg;
-   int err;
-
-   err = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
+   int err, alloc;
+retry:
+   alloc = idr_pre_get(&ids->ipcs_idr, GFP_KERNEL);
 
mutex_lock(&ids->mutex);
ipcp = ipc_findkey(ids, params->key);
@@ -382,8 +385,6 @@ int ipcget_public(struct ipc_namespace *
/* key not used */
if (!(flg & IPC_CREAT))
err = -ENOENT;
-   else if (!err)
-   err = -ENOMEM;
else
err = ops->getnew(ns, params);
} else {
@@ -406,6 +407,12 @@ int ipcget_public(struct ipc_name

[PATCH] IPC: cleanup some code and wrong comments about semundo list managment

2007-10-05 Thread Pierre Peiffer

From: Pierre Peiffer <[EMAIL PROTECTED]>

Some comments about sem_undo_list seem wrong.
About the comment above unlock_semundo:
"... If task2 now exits before task1 releases the lock (by calling
unlock_semundo()), then task1 will never call spin_unlock(). ..."

This is just wrong, I see no reason for which task1 will not call
spin_unlock... The rest of this comment is also wrong... Unless I
miss something (of course).

Finally, (un)lock_semundo functions are useless, so remove them
for simplification. (this avoids an useless if statement)

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---

 ipc/sem.c |   46 ++
 1 files changed, 6 insertions(+), 40 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index b676fef..5585817 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -957,36 +957,6 @@ asmlinkage long sys_semctl (int semid, int semnum, int 
cmd, union semun arg)
}
 }
 
-static inline void lock_semundo(void)
-{
-   struct sem_undo_list *undo_list;
-
-   undo_list = current->sysvsem.undo_list;
-   if (undo_list)
-   spin_lock(&undo_list->lock);
-}
-
-/* This code has an interaction with copy_semundo().
- * Consider; two tasks are sharing the undo_list. task1
- * acquires the undo_list lock in lock_semundo().  If task2 now
- * exits before task1 releases the lock (by calling
- * unlock_semundo()), then task1 will never call spin_unlock().
- * This leave the sem_undo_list in a locked state.  If task1 now creats task3
- * and once again shares the sem_undo_list, the sem_undo_list will still be
- * locked, and future SEM_UNDO operations will deadlock.  This case is
- * dealt with in copy_semundo() by having it reinitialize the spin lock when 
- * the refcnt goes from 1 to 2.
- */
-static inline void unlock_semundo(void)
-{
-   struct sem_undo_list *undo_list;
-
-   undo_list = current->sysvsem.undo_list;
-   if (undo_list)
-   spin_unlock(&undo_list->lock);
-}
-
-
 /* If the task doesn't already have a undo_list, then allocate one
  * here.  We guarantee there is only one thread using this undo list,
  * and current is THE ONE
@@ -1047,9 +1017,9 @@ static struct sem_undo *find_undo(struct ipc_namespace 
*ns, int semid)
if (error)
return ERR_PTR(error);
 
-   lock_semundo();
+   spin_lock(&ulp->lock);
un = lookup_undo(ulp, semid);
-   unlock_semundo();
+   spin_unlock(&ulp->lock);
if (likely(un!=NULL))
goto out;
 
@@ -1077,10 +1047,10 @@ static struct sem_undo *find_undo(struct ipc_namespace 
*ns, int semid)
new->semadj = (short *) &new[1];
new->semid = semid;
 
-   lock_semundo();
+   spin_lock(&ulp->lock);
un = lookup_undo(ulp, semid);
if (un) {
-   unlock_semundo();
+   spin_unlock(&ulp->lock);
kfree(new);
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
@@ -1091,7 +1061,7 @@ static struct sem_undo *find_undo(struct ipc_namespace 
*ns, int semid)
ipc_rcu_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma);
-   unlock_semundo();
+   spin_unlock(&ulp->lock);
kfree(new);
un = ERR_PTR(-EIDRM);
goto out;
@@ -1102,7 +1072,7 @@ static struct sem_undo *find_undo(struct ipc_namespace 
*ns, int semid)
sma->undo = new;
sem_unlock(sma);
un = new;
-   unlock_semundo();
+   spin_unlock(&ulp->lock);
 out:
return un;
 }
@@ -1279,10 +1249,6 @@ asmlinkage long sys_semop (int semid, struct sembuf 
__user *tsops, unsigned nsop
 
 /* If CLONE_SYSVSEM is set, establish sharing of SEM_UNDO state between
  * parent and child tasks.
- *
- * See the notes above unlock_semundo() regarding the spin_lock_init()
- * in this code.  Initialize the undo_list->lock here instead of 
get_undo_list()
- * because of the reasoning in the comment above unlock_semundo.
  */
 
 int copy_semundo(unsigned long clone_flags, struct task_struct *tsk)


Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/2] New API to change the IDs of an existing IPC

2007-10-01 Thread Pierre Peiffer



Michael Kerrisk a écrit :
> Hi Pierre,
> 
>> As I'm seeing some discussion/interest about IPC, I would like to 
>> propose
>>  these patches, which provide an easy way to change the ID of an exiting IPC.
>> This work is done around the checkpoint/restart of applications. In the case 
>> of
>> the IPCs, we need (among others) this functionality.
> 
> Can you give some more detailed explanation of why this
> functionaility is needed.  

Sure; in the case of the checkpoint/restart, when you restart an application, 
what you want is to recreate all system ressources with the same properties 
they had, when you have checkpointed it.
For IPCs, this means that you need to recreate all the IPCs with the same IDs 
(at least).
For now, this ID is computed by the system when an IPC is created and you can't 
specify or modify it.

These patches give you the possibility of changing this ID once the IPC is 
created.

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 2/2] System V IPC: new IPC_SETID command to modify an ID

2007-09-28 Thread Pierre Peiffer



From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch adds a new IPC_SETID command to the System V IPCs set of commands,
which allows to change the ID of an existing IPC.

This command can be used through the semctl/shmctl/msgctl API, with the new
ID passed as the third argument for msgctl and shmctl (instead of a pointer)
and through the fourth argument for semctl.

To be successful, the following rules must be respected:
- the IPC exists
- the user must be allowed to change the IPC attributes regarding the IPC
  permissions.
- the new ID must satisfy the ID computation rule.
- the entry (in the kernel internal table of IPCs) corresponding to the new
  ID must be free.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 include/linux/ipc.h  |9 
 ipc/msg.c|   31 +++---
 ipc/sem.c|   31 +++---
 ipc/shm.c|   55 ++
 security/selinux/hooks.c |3 +++
 5 files changed, 100 insertions(+), 29 deletions(-)

diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 3fd3ddd..f1edef5 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -35,10 +35,11 @@ struct ipc_perm
  * Control commands used with semctl, msgctl and shmctl 
  * see also specific commands in sem.h, msg.h and shm.h
  */
-#define IPC_RMID 0 /* remove resource */
-#define IPC_SET  1 /* set ipc_perm options */
-#define IPC_STAT 2 /* get ipc_perm options */
-#define IPC_INFO 3 /* see ipcs */
+#define IPC_RMID  0 /* remove resource */
+#define IPC_SET   1 /* set ipc_perm options */
+#define IPC_STAT  2 /* get ipc_perm options */
+#define IPC_INFO  3 /* see ipcs */
+#define IPC_SETID 4 /* set ipc ID */
 
 /*
  * Version flags for semctl, msgctl, and shmctl commands
diff --git a/ipc/msg.c b/ipc/msg.c
index d9d4093..9671156 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -81,6 +81,8 @@ static struct ipc_ids init_msg_ids;
 #define msg_buildid(ns, id, seq) \
ipc_buildid(&msg_ids(ns), id, seq)
 
+static long msg_chid_nolock(struct ipc_namespace *ns, struct msg_queue *msq,
+   int newid);
 static void freeque (struct ipc_namespace *ns, struct msg_queue *msq, int id);
 static int newque (struct ipc_namespace *ns, key_t key, int msgflg);
 #ifdef CONFIG_PROC_FS
@@ -382,6 +384,21 @@ copy_msqid_from_user(struct msq_setbuf *out, void __user 
*buf, int version)
}
 }
 
+static long msg_chid_nolock(struct ipc_namespace *ns, struct msg_queue *msq,
+   int newid)
+{
+   long err;
+   err = ipc_mvid(&msg_ids(ns), msq->q_id,
+  newid, ns->msg_ctlmni);
+
+   if (err)
+   return err;
+
+   msq->q_id = newid;
+   msq->q_ctime = get_seconds();
+   return 0;
+}
+
 long msg_mvid(struct ipc_namespace *ns, int id, int newid)
 {
long err;
@@ -398,14 +415,7 @@ long msg_mvid(struct ipc_namespace *ns, int id, int newid)
if (err)
goto out_unlock_up;
 
-   err = ipc_mvid(&msg_ids(ns), id,
-  newid, ns->msg_ctlmni);
-
-   if (err)
-   goto out_unlock_up;
-
-   msq->q_id = newid;
-   msq->q_ctime = get_seconds();
+   err = msg_chid_nolock(ns, msq, newid);
 
 out_unlock_up:
msg_unlock(msq);
@@ -521,6 +531,7 @@ asmlinkage long sys_msgctl(int msqid, int cmd, struct 
msqid_ds __user *buf)
if (copy_msqid_from_user(&setbuf, buf, version))
return -EFAULT;
break;
+   case IPC_SETID:
case IPC_RMID:
break;
default:
@@ -583,6 +594,10 @@ asmlinkage long sys_msgctl(int msqid, int cmd, struct 
msqid_ds __user *buf)
msg_unlock(msq);
break;
}
+   case IPC_SETID:
+   err = msg_chid_nolock(ns, msq, (int)buf);
+   msg_unlock(msq);
+   break;
case IPC_RMID:
freeque(ns, msq, msqid);
break;
diff --git a/ipc/sem.c b/ipc/sem.c
index 606f2e9..b78b433 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -98,6 +98,8 @@
 
 static struct ipc_ids init_sem_ids;
 
+static long sem_chid_nolock(struct ipc_namespace *ns, struct sem_array *sma,
+   int newid);
 static int newary(struct ipc_namespace *, key_t, int, int);
 static void freeary(struct ipc_namespace *ns, struct sem_array *sma, int id);
 #ifdef CONFIG_PROC_FS
@@ -906,6 +908,10 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid, int semnum,
sem_unlock(sma);
err = 0;
break;
+   case IPC_SETID:
+   err = sem_chid_nolock(ns, sma, (int)arg.val);
+   sem_unlock(sma);
+   break;
default:
sem_unlock(sma);
err = -EINVAL;
@@ -918,6 +924,21 @@ out_unlock:
ret

[RFC][PATCH 1/2] System V IPC: new kernel API to change an ID

2007-09-28 Thread Pierre Peiffer



From: Pierre Peiffer <[EMAIL PROTECTED]>

This patch provides three new API for changing the ID of an existing
System V IPCs.

These APIs are:
long msg_mvid(struct ipc_namespace *ns, int id, int newid);
long sem_mvid(struct ipc_namespace *ns, int id, int newid);
long shm_mvid(struct ipc_namespace *ns, int id, int newid);

They return 0 or an error code in case of failure.

They may be useful for setting a specific ID for an IPC when preparing
a restart operation.

To be successful, the following rules must be respected:
- the IPC exists (of course...)
- the new ID must satisfy the ID computation rule.
- the entry (in the kernel internal table of IPCs) corresponding to the new
  ID must be free.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 include/linux/msg.h |1 +
 include/linux/sem.h |1 +
 include/linux/shm.h |1 +
 ipc/msg.c   |   32 +++
 ipc/sem.c   |   32 +++
 ipc/shm.c   |   30 ++
 ipc/util.c  |   60 +++
 ipc/util.h  |1 +
 8 files changed, 158 insertions(+), 0 deletions(-)

diff --git a/include/linux/msg.h b/include/linux/msg.h
index f1b6074..5a1db95 100644
--- a/include/linux/msg.h
+++ b/include/linux/msg.h
@@ -97,6 +97,7 @@ extern long do_msgsnd(int msqid, long mtype, void __user 
*mtext,
size_t msgsz, int msgflg);
 extern long do_msgrcv(int msqid, long *pmtype, void __user *mtext,
size_t msgsz, long msgtyp, int msgflg);
+long msg_mvid(struct ipc_namespace *ns, int id, int newid);
 
 #endif /* __KERNEL__ */
 
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 9aaffb0..b5989fb 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -142,6 +142,7 @@ struct sysv_sem {
 
 extern int copy_semundo(unsigned long clone_flags, struct task_struct *tsk);
 extern void exit_sem(struct task_struct *tsk);
+long sem_mvid(struct ipc_namespace *ns, int id, int newid);
 
 #else
 static inline int copy_semundo(unsigned long clone_flags, struct task_struct 
*tsk)
diff --git a/include/linux/shm.h b/include/linux/shm.h
index ad2e3af..f4ae995 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -97,6 +97,7 @@ struct shmid_kernel /* private to the kernel */
 #ifdef CONFIG_SYSVIPC
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long 
*addr);
 extern int is_file_shm_hugepages(struct file *file);
+long shm_mvid(struct ipc_namespace *ns, int id, int newid);
 #else
 static inline long do_shmat(int shmid, char __user *shmaddr,
int shmflg, unsigned long *addr)
diff --git a/ipc/msg.c b/ipc/msg.c
index a03fcb5..d9d4093 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -382,6 +382,38 @@ copy_msqid_from_user(struct msq_setbuf *out, void __user 
*buf, int version)
}
 }
 
+long msg_mvid(struct ipc_namespace *ns, int id, int newid)
+{
+   long err;
+   struct msg_queue *msq;
+
+   mutex_lock(&msg_ids(ns).mutex);
+   msq = msg_lock(ns, id);
+
+   err = -EINVAL;
+   if (msq == NULL)
+   goto out_up;
+
+   err = msg_checkid(ns, msq, id);
+   if (err)
+   goto out_unlock_up;
+
+   err = ipc_mvid(&msg_ids(ns), id,
+  newid, ns->msg_ctlmni);
+
+   if (err)
+   goto out_unlock_up;
+
+   msq->q_id = newid;
+   msq->q_ctime = get_seconds();
+
+out_unlock_up:
+   msg_unlock(msq);
+out_up:
+   mutex_unlock(&msg_ids(ns).mutex);
+   return err;
+}
+
 asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
 {
struct kern_ipc_perm *ipcp;
diff --git a/ipc/sem.c b/ipc/sem.c
index b676fef..606f2e9 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -918,6 +918,38 @@ out_unlock:
return err;
 }
 
+long sem_mvid(struct ipc_namespace *ns, int id, int newid)
+{
+   long err;
+   struct sem_array *sma;
+
+   mutex_lock(&sem_ids(ns).mutex);
+   sma = sem_lock(ns, id);
+
+   err = -EINVAL;
+   if (sma == NULL)
+   goto out_up;
+
+   err = sem_checkid(ns, sma, id);
+   if (err)
+   goto out_unlock_up;
+
+   err = ipc_mvid(&sem_ids(ns), id,
+  newid, ns->sc_semmni);
+
+   if (err)
+   goto out_unlock_up;
+
+   sma->sem_id = newid;
+   sma->sem_ctime = get_seconds();
+
+out_unlock_up:
+   sem_unlock(sma);
+out_up:
+   mutex_unlock(&sem_ids(ns).mutex);
+   return err;
+}
+
 asmlinkage long sys_semctl (int semid, int semnum, int cmd, union semun arg)
 {
int err = -EINVAL;
diff --git a/ipc/shm.c b/ipc/shm.c
index a86a3a5..5f4bca6 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -156,7 +156,37 @@ static inline int shm_addid(struct ipc_namespace *ns, 
struct shmid_kernel *shp)
return ipc_addid(&shm_ids

[RFC][PATCH 0/2] New API to change the IDs of an existing IPC

2007-09-28 Thread Pierre Peiffer

Hi,

As I'm seeing some discussion/interest about IPC, I would like to 
propose these patches, which provide an easy way to change the ID of an exiting 
IPC.
This work is done around the checkpoint/restart of applications. In the case of 
the IPCs, we need (among others) this functionality.

May be there is some other interest about this, so, the RFC concerns both the 
coding and the interest.

The first patch provide the functionality in kernel side, the second export it 
in user space. They apply on top of the latest -rc8 kernel. I can adapt them 
for the latest -mm.

Thanks,

-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [resend][PATCH] Remove duplicated declarations in procfs

2007-09-28 Thread Pierre Peiffer



Andrew Morton a écrit :
> 
> yup, thanks, the maps2 patches in -mm already accidentally fixed this so I 
> haven't
> bothered merging it as a standalone thing.
> 
Ok, good.

Thanks,

-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[resend][PATCH] Remove duplicated declarations in procfs

2007-09-28 Thread Pierre Peiffer


From: Pierre Peiffer <[EMAIL PROTECTED]>

This is a trivial patch that removes some duplicated declarations of
extern variables.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 fs/proc/internal.h |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index b215c35..d812816 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -50,10 +50,6 @@ extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;
 
-extern const struct file_operations proc_maps_operations;
-extern const struct file_operations proc_numa_maps_operations;
-extern const struct file_operations proc_smaps_operations;
-
 
 void free_proc_entry(struct proc_dir_entry *de);
 

-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fs/proc/internal.h cleanup.

2007-09-14 Thread Pierre Peiffer


From: Pierre Peiffer <[EMAIL PROTECTED]>

These extern variables are declared twice, so it removes one of the declaration.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>
---

 fs/proc/internal.h |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index b215c35..d812816 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -50,10 +50,6 @@ extern const struct file_operations proc_maps_operations;
 extern const struct file_operations proc_numa_maps_operations;
 extern const struct file_operations proc_smaps_operations;
 
-extern const struct file_operations proc_maps_operations;
-extern const struct file_operations proc_numa_maps_operations;
-extern const struct file_operations proc_smaps_operations;
-
 
 void free_proc_entry(struct proc_dir_entry *de);
 

-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Futex: Revert the non-functional REQUEUE_PI

2007-06-18 Thread Pierre Peiffer


Thomas Gleixner wrote :
Patch d0aa7a70bf03b9de9e995ab272293be1f7937822 titled 


"futex_requeue_pi optimization"

introduced user space visible changes to the futex syscall.

The patch is non-functional and there is no way to fix it proper before
the 2.6.22 release. 


The breakage report ( http://lkml.org/lkml/2007/5/12/17 ) went
unanswered,


Sorry, but I passed lot of time on this last year without any answers or 
comments from the community when I have sent my patches.
Now I'm working on something else and I can't spent more time on this for now... 
Futexes are so complex that it is always difficult to investigate a problem in 
only one or two hours...



and unfortunately it turned out that the concept is not
feasible at all. 


Without robust futex, the concept works well, and the performance gain has been 
proven.


It violates the rtmutex semantics badly by introducing

a virtual owner, which hacks around the coupling of the user-space
pi_futex and the kernel internal rt_mutex representation.


What you call a hack, is for me a design point. And if this is a hack, then I 
think that we can say that futexes are based on a series of hacks...

But, okay, this is not very constructive, so...

Ulrich Drepper wrote :
>
> Indeed.  A lot more discussion is needed to handle this correctly.  No
> committed code in glibc so far uses the function so removal is no problem.

Once again, it's a pity that people didn't spent time to comment on this when I 
was working on this.

Now, the work will probably just be lost...

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, take6] FUTEX : new PRIVATE futexes

2007-04-26 Thread Pierre Peiffer


Eric Dumazet a écrit :

Hi Andrew

Not sure if you prefer to wait Pierre work on futex64, so just in case, I 
prepared this patch.

Update on this take6 :

- Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm


Pierre, I can resubmit another patch on top on your next patch, so please do as 
you prefer (ignoring or not this patch)


Thank you for taking care of this.
But I think your patch is more mature than the futex64 patch; So, it's ok for 
me, and it's probably simpler and better for "the community" (and for Andrew's 
work) to put this patch first, because futex64 still may need several reworks...


--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm take2] 64bit-futex - provide new commands instead of new syscall

2007-04-24 Thread Pierre Peiffer


Ulrich Drepper a écrit :


It looks mostly good.  I wouldn't use the high bit to differentiate
the 64-bit operations, though.  Since we do not allow to apply it to
all operations the only effect will be that the compiler has a harder
time generating the code for the switch statement.  If you use
continuous values a simple jump table can be used and no conditionals.
Smaller and faster.



Something like that may be...

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>


--
Pierre
---
 include/asm-ia64/futex.h|8 -
 include/asm-powerpc/futex.h |6 -
 include/asm-s390/futex.h|8 -
 include/asm-sparc64/futex.h |8 -
 include/asm-um/futex.h  |9 -
 include/asm-x86_64/futex.h  |   86 --
 include/asm-x86_64/unistd.h |2 
 include/linux/futex.h   |6 +
 include/linux/syscalls.h|3 
 kernel/futex.c  |  203 ++--
 kernel/futex_compat.c   |2 
 kernel/sys_ni.c |1 
 12 files changed, 95 insertions(+), 247 deletions(-)

Index: b/include/asm-ia64/futex.h
===
--- a/include/asm-ia64/futex.h
+++ b/include/asm-ia64/futex.h
@@ -124,13 +124,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* _ASM_FUTEX_H */
Index: b/include/asm-powerpc/futex.h
===
--- a/include/asm-powerpc/futex.h
+++ b/include/asm-powerpc/futex.h
@@ -119,11 +119,5 @@ futex_atomic_cmpxchg_inatomic64(u64 __us
 	return 0;
 }
 
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
-}
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_FUTEX_H */
Index: b/include/asm-s390/futex.h
===
--- a/include/asm-s390/futex.h
+++ b/include/asm-s390/futex.h
@@ -51,13 +51,7 @@ static inline int futex_atomic_cmpxchg_i
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* __KERNEL__ */
Index: b/include/asm-sparc64/futex.h
===
--- a/include/asm-sparc64/futex.h
+++ b/include/asm-sparc64/futex.h
@@ -108,13 +108,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* !(_SPARC64_FUTEX_H) */
Index: b/include/asm-um/futex.h
===
--- a/include/asm-um/futex.h
+++ b/include/asm-um/futex.h
@@ -6,14 +6,7 @@
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
+	return -ENOSYS;
 }
 
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
-}
-
-
 #endif
Index: b/include/asm-x86_64/futex.h
===
--- a/include/asm-x86_64/futex.h
+++ b/include/asm-x86_64/futex.h
@@ -41,38 +41,6 @@
 	  "=&r" (tem)		\
 	: "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
 
-#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
-  __asm__ __volatile (		\
-"1:	" insn "\n"		\
-"2:	.section .fixup,\"ax\"\n\
-3:	movq	%3, %1\n\
-	jmp	2b\n\
-	.previous\n\
-	.section __ex_table,\"a\"\n\
-	.align	8\n\
-	.quad	1b,3b\n\
-	.previous"		\
-	: "=r" (oldval), "=r" (ret), "=m" (*uaddr)		\
-	: "i" (-EFAULT), "m" (*uaddr), "0" (oparg), "1" (0))
-
-#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
-  __asm__ __volatile (		\
-"1:	movq	%2, %0\n\
-	movq	%0, %3\n"	\
-	insn "\n"		\
-"2:	" LOCK_PREFIX "cmpxchgq %3, %2\n\
-	jnz	1b\n\
-3:	.section .fixup,\"ax\"\n\
-4:	movq	%5, %1\n\
-	jmp	3b\n\
-	.previous\n\
-	.section __ex_table,\"a\"\n\
-	.align	8\n\
-	.quad	1b,4b,2b,4b\n\
-	.previous"		\
-	: "=&a" (oldval), "=&r" (ret), "=m" (*uaddr),		\
-	  "=&r" (tem)		\
-	: "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
 
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
@@ -128,60 +96,6 @@ futex_atomic_op_inuser (int encoded_

[PATCH -mm] 64bit-futex - provide new commands instead of new syscall

2007-04-23 Thread Pierre Peiffer


Hi,

Jakub Jelinek a écrit :


I don't think you should blindly add all operations to sys_futex64 without
thinking what they really do.
E.g. FUTEX_{{,UN,TRY}LOCK,CMP_REQUEUE}_PI doesn't really make any sense for 
64-bit
futexes, the format of PI futexes is hardcoded in the kernel and is always
32-bit, see FUTEX_TID_MASK, FUTEX_WAITERS, FUTEX_OWNER_DIED definitions.
exit_robust_list/handle_futex_death will handle 32-bit PI futexes anyway.
Similarly, sys_futex64 shouldn't support the obsolete operations that
are there solely for compatibility (e.g. FUTEX_REQUEUE or FUTEX_FD).

When you just -ENOSYS on the PI ops, there is no need to implement
futex_atomic_cmpxchg_inatomic64.

FUTEX_WAKE_OP is questionable for 64-bit, IMHO it is better to just
-ENOSYS on it and only if anyone ever finds actual uses for it, add it.

For 64-bit futexes the only needed operations are actually
FUTEX_WAIT and perhaps FUTEX_CMP_REQUEUE, so I wonder if it isn't
better to just add FUTEX_WAIT64 and FUTEX_CMP_REQUEUE64 ops to sys_futex
instead of adding a new syscall.

But the FUTEX_{{,UN,TRY}LOCK,CMP_REQUEUE}_PI removal for 64-bit futexes
is IMHO the most important part of my complain.



Following this mail sent few weeks ago, here is a patch which should meet your 
requirements.
I've quickly done it on top of the latest -mm (2.6.21-rc6-mm2) and a little bit 
tested.
To be honest, as I'm not really aware of your exact needs and as I don't know 
the exact usage which will be done with 64bit futexes, I can't really maintain 
it. So I'll let you take/modify/adapt this patch following your needs.


Thanks,

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>



--
Pierre
---
 include/asm-ia64/futex.h|8 -
 include/asm-powerpc/futex.h |6 -
 include/asm-s390/futex.h|8 -
 include/asm-sparc64/futex.h |8 -
 include/asm-um/futex.h  |9 -
 include/asm-x86_64/futex.h  |   86 ---
 include/asm-x86_64/unistd.h |2 
 include/linux/futex.h   |8 +
 include/linux/syscalls.h|3 
 kernel/futex.c  |  199 +---
 kernel/futex_compat.c   |2 
 kernel/sys_ni.c |1 
 12 files changed, 93 insertions(+), 247 deletions(-)

Index: linux-2.6.21-rc6-mm2/include/asm-ia64/futex.h
===
--- linux-2.6.21-rc6-mm2.orig/include/asm-ia64/futex.h	2007-04-20 14:01:25.0 +0200
+++ linux-2.6.21-rc6-mm2/include/asm-ia64/futex.h	2007-04-20 13:50:00.0 +0200
@@ -124,13 +124,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* _ASM_FUTEX_H */
Index: linux-2.6.21-rc6-mm2/include/asm-powerpc/futex.h
===
--- linux-2.6.21-rc6-mm2.orig/include/asm-powerpc/futex.h	2007-04-20 14:01:25.0 +0200
+++ linux-2.6.21-rc6-mm2/include/asm-powerpc/futex.h	2007-04-20 13:51:49.0 +0200
@@ -119,11 +119,5 @@ futex_atomic_cmpxchg_inatomic64(u64 __us
 	return 0;
 }
 
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
-}
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_FUTEX_H */
Index: linux-2.6.21-rc6-mm2/include/asm-s390/futex.h
===
--- linux-2.6.21-rc6-mm2.orig/include/asm-s390/futex.h	2007-04-20 14:01:24.0 +0200
+++ linux-2.6.21-rc6-mm2/include/asm-s390/futex.h	2007-04-20 13:47:30.0 +0200
@@ -51,13 +51,7 @@ static inline int futex_atomic_cmpxchg_i
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* __KERNEL__ */
Index: linux-2.6.21-rc6-mm2/include/asm-sparc64/futex.h
===
--- linux-2.6.21-rc6-mm2.orig/include/asm-sparc64/futex.h	2007-04-20 14:01:25.0 +0200
+++ linux-2.6.21-rc6-mm2/include/asm-sparc64/futex.h	2007-04-20 13:48:48.0 +0200
@@ -108,13 +108,7 @@ futex_atomic_cmpxchg_inatomic(int __user
 static inline u64
 futex_atomic_cmpxchg_inatomic64(u64 __user *uaddr, u64 oldval, u64 newval)
 {
-	return 0;
-}
-
-static inline int
-futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
-{
-	return 0;
+	return -ENOSYS;
 }
 
 #endif /* !(_SPARC64_FUTEX_H) */
Index: linux-2.6.21-rc6-mm2/include/asm-um/futex.h
===
--- linux-2.6.21-rc6-mm2.orig/include/asm-um/futex.h	2007-04-20 14:01:25.0 +0200
+++ linux-2.6.21-rc6-mm2/include/asm-um/futex.h	2007-04-20 13:51:42.

Re: [PATCH -mm] fix undefined symbol if CONFIG_PAGE_GROUP_BY_MOBILITY not set

2007-04-20 Thread Pierre Peiffer




Hi,

Your fix looks correct but the compile-time option 
CONFIG_PAGE_GROUP_BY_MOBILITY was removed in a patch I sent to Andrew 
two days ago. It went into mm-commits last night so this problem should 
no longer exist.


Ah ok, fine. Just forget it in this case :)
Thanks,

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] fix undefined symbol if CONFIG_PAGE_GROUP_BY_MOBILITY not set

2007-04-20 Thread Pierre Peiffer

Hi,

This is a fix against the patch
do-not-group-pages-by-mobility-type-on-low-memory-systems.patch (include in -mm
tree):
The error "page_group_by_mobility_disabled undefinied" occured if
CONFIG_PAGE_GROUP_BY_MOBILITY is not set.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |5 +
 1 file changed, 5 insertions(+)

Index: b/mm/page_alloc.c
===
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2297,6 +2297,7 @@ void __meminit build_all_zonelists(void)
}
vm_total_pages = nr_free_pagecache_pages();
 
+#ifdef CONFIG_PAGE_GROUP_BY_MOBILITY
/*
 * Disable grouping by mobility if the number of pages in the
 * system is too low to allow the mechanism to work. It would be
@@ -2313,6 +2314,10 @@ void __meminit build_all_zonelists(void)
num_online_nodes(),
page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
+#else
+   printk("Built %i zonelists.  Total pages: %ld\n",
+  num_online_nodes(), vm_total_pages);
+#endif
 }
 
 /*


-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-11 Thread Pierre Peiffer


Nick Piggin a écrit :


But... that isn't there in mainline. Why is it in -mm?


This was introduced by lguest code
I did not follow exaclty why.

Pierre

At any rate, that 
makes

it a no brainer to change.



As this external thing certainly is not doing the check itself, to be 
on the safe side we should enforce it in get_futex_key(). I agree with 
you : If we want to maximize performance, we could say : The check 
*must* be done by the caller.


Well we _control_ the API, so let's make it as clean and performant as 
possible

from the start.



--
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] Fix: timeout not passed anymore to futex_lock_pi

2007-03-26 Thread Pierre Peiffer


This is a fix for a bug introduced by the patch
make-futex_wait-use-an-hrtimer-for-timeout.patch : the timeout value
is not passed anymore to futex_lock_pi.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 kernel/futex.c|8 ++--
 kernel/futex_compat.c |4 +++-
 2 files changed, 9 insertions(+), 3 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2383,8 +2383,10 @@ sys_futex64(u64 __user *uaddr, int op, u
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
+
+   t = timespec_to_ktime(ts);
if (op == FUTEX_WAIT)
-   t = ktime_add(ktime_get(), timespec_to_ktime(ts));
+   t = ktime_add(ktime_get(), t);
tp = &t;
}
/*
@@ -2413,8 +2415,10 @@ asmlinkage long sys_futex(u32 __user *ua
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
+
+   t = timespec_to_ktime(ts);
if (op == FUTEX_WAIT)
-   t = ktime_add(ktime_get(), timespec_to_ktime(ts));
+   t = ktime_add(ktime_get(), t);
tp = &t;
}
/*
Index: b/kernel/futex_compat.c
===
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -150,8 +150,10 @@ asmlinkage long compat_sys_futex(u32 __u
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
+
+   t = timespec_to_ktime(ts);
if (op == FUTEX_WAIT)
-   t = ktime_add(ktime_get(), timespec_to_ktime(ts));
+   t = ktime_add(ktime_get(), t);
tp = &t;
}
if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE


-- 
Pierre Peiffer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4-mm1 2/4] Make futex_wait() use an hrtimer for timeout

2007-03-21 Thread Pierre . Peiffer

This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

  schedule_timeout() is tick based, therefore the timeout granularity is
the tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution
timer for timeout wakeup, we can attain a much finer timeout granularity
(in the microsecond range). This parallels what is already done for
futex_lock_pi().

  The timeout passed to the syscall is no longer converted to jiffies
and is therefore passed to do_futex() and futex_wait() as an absolute
ktime_t therefore keeping nanosecond resolution.

  Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

  In futex_wait(), if there is no timeout then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.

Signed-off-by: Sebastien Dugue <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/linux/futex.h |3 +
 kernel/futex.c|   85 --
 kernel/futex_compat.c |   17 --
 3 files changed, 51 insertions(+), 54 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1001,16 +1001,16 @@ static void unqueue_me_pi(struct futex_q
 }
 
 static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait_abstime(u32 __user *uaddr, u32 val,
-   int timed, unsigned long abs_time)
+static int futex_wait(u32 __user *uaddr, u32 val, ktime_t *abs_time)
 {
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
struct futex_hash_bucket *hb;
struct futex_q q;
-   unsigned long time_left = 0;
u32 uval;
int ret;
+   struct hrtimer_sleeper t;
+   int rem = 0;
 
q.pi_state = NULL;
  retry:
@@ -1088,20 +1088,29 @@ static int futex_wait_abstime(u32 __user
 * !plist_node_empty() is safe here without any lock.
 * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
 */
-   time_left = 0;
if (likely(!plist_node_empty(&q.list))) {
-   unsigned long rel_time;
+   if (!abs_time)
+   schedule();
+   else {
+   hrtimer_init(&t.timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_ABS);
+   hrtimer_init_sleeper(&t, current);
+   t.timer.expires = *abs_time;
+
+   hrtimer_start(&t.timer, t.timer.expires, 
HRTIMER_MODE_ABS);
+
+   /*
+* the timer could have already expired, in which
+* case current would be flagged for rescheduling.
+* Don't bother calling schedule.
+*/
+   if (likely(t.task))
+   schedule();
 
-   if (timed) {
-   unsigned long now = jiffies;
-   if (time_after(now, abs_time))
-   rel_time = 0;
-   else
-   rel_time = abs_time - now;
-   } else
-   rel_time = MAX_SCHEDULE_TIMEOUT;
+   hrtimer_cancel(&t.timer);
 
-   time_left = schedule_timeout(rel_time);
+   /* Flag if a timeout occured */
+   rem = (t.task == NULL);
+   }
}
__set_current_state(TASK_RUNNING);
 
@@ -1113,14 +1122,14 @@ static int futex_wait_abstime(u32 __user
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!unqueue_me(&q))
return 0;
-   if (time_left == 0)
+   if (rem)
return -ETIMEDOUT;
 
/*
 * We expect signal_pending(current), but another thread may
 * have handled it for us already.
 */
-   if (time_left == MAX_SCHEDULE_TIMEOUT)
+   if (!abs_time)
return -ERESTARTSYS;
else {
struct restart_block *restart;
@@ -1128,8 +1137,7 @@ static int futex_wait_abstime(u32 __user
restart->fn = futex_wait_restart;
restart->arg0 = (unsigned long)uaddr;
restart->arg1 = (unsigned long)val;
-   restart->arg2 = (unsigned long)timed;
-   restart->arg3 = abs_time;
+   restart->arg2 = (unsigned long)abs_time;
return -ERESTART_RESTARTBLOCK;
}
 
@@ -1141,21 +1149,15 @@ static int futex_wait_abstime(u32 __user
return ret;
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long rel_time)
-{
-   int timed = (rel_time != MAX_SCHEDULE_TIMEOUT);
-   return futex_wait_abstime(uaddr, val, timed, jiffies+rel_time);
-}
 
 static long futex

[PATCH 2.6.21-rc4-mm1 3/4] futex_requeue_pi optimization

2007-03-21 Thread Pierre . Peiffer

This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of
a PI-futex.

This provides an optimization, already used for (normal) futexes, to be used 
with
the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT
mutexes too.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/linux/futex.h   |9 
 kernel/futex.c  |  541 +++-
 kernel/futex_compat.c   |3 
 kernel/rtmutex.c|   41 ---
 kernel/rtmutex_common.h |   34 +++
 5 files changed, 540 insertions(+), 88 deletions(-)

Index: b/include/linux/futex.h
===
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -16,6 +16,7 @@
 #define FUTEX_LOCK_PI  6
 #define FUTEX_UNLOCK_PI7
 #define FUTEX_TRYLOCK_PI   8
+#define FUTEX_CMP_REQUEUE_PI   9
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -84,9 +85,14 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED   0x4000
 
 /*
+ * Some processes have been requeued on this PI-futex
+ */
+#define FUTEX_WAITER_REQUEUED  0x2000
+
+/*
  * The rest of the robust-futex field is for the TID:
  */
-#define FUTEX_TID_MASK 0x3fff
+#define FUTEX_TID_MASK 0x0fff
 
 /*
  * This limit protects against a deliberately circular list.
@@ -110,6 +116,7 @@ handle_futex_death(u32 __user *uaddr, st
  * We set bit 0 to indicate if it's an inode-based key.
  */
 union futex_key {
+   u32 __user *uaddr;
struct {
unsigned long pgoff;
struct inode *inode;
Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -53,6 +53,12 @@
 
 #include "rtmutex_common.h"
 
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# include "rtmutex-debug.h"
+#else
+# include "rtmutex.h"
+#endif
+
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
@@ -102,6 +108,12 @@ struct futex_q {
/* Optional priority inheritance state: */
struct futex_pi_state *pi_state;
struct task_struct *task;
+
+   /*
+* This waiter is used in case of requeue from a
+* normal futex to a PI-futex
+*/
+   struct rt_mutex_waiter waiter;
 };
 
 /*
@@ -180,6 +192,9 @@ int get_futex_key(u32 __user *uaddr, uni
if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
 
+   /* Save the user address in the ley */
+   key->uaddr = uaddr;
+
/*
 * Private mappings are handled in a simple way.
 *
@@ -439,7 +454,8 @@ void exit_pi_state_list(struct task_stru
 }
 
 static int
-lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, struct futex_q *me)
+lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+   union futex_key *key, struct futex_pi_state **ps)
 {
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
@@ -450,7 +466,7 @@ lookup_pi_state(u32 uval, struct futex_h
head = &hb->chain;
 
plist_for_each_entry_safe(this, next, head, list) {
-   if (match_futex(&this->key, &me->key)) {
+   if (match_futex(&this->key, key)) {
/*
 * Another waiter already exists - bump up
 * the refcount and return its pi_state:
@@ -465,7 +481,7 @@ lookup_pi_state(u32 uval, struct futex_h
WARN_ON(!atomic_read(&pi_state->refcount));
 
atomic_inc(&pi_state->refcount);
-   me->pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
}
@@ -492,7 +508,7 @@ lookup_pi_state(u32 uval, struct futex_h
rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p);
 
/* Store the key for possible exit cleanups: */
-   pi_state->key = me->key;
+   pi_state->key = *key;
 
spin_lock_irq(&p->pi_lock);
WARN_ON(!list_empty(&pi_state->list));
@@ -502,7 +518,7 @@ lookup_pi_state(u32 uval, struct futex_h
 
put_task_struct(p);
 
-   me->pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
 }
@@ -562,6 +578,8 @@ static int wake_futex_pi(u32 __user *uad
 */
if (!(uval & FUTEX_OWNER_DIED)) {
newval = FUTEX_WAITERS | new_owner->pid;
+   /* Keep the FUTEX_WAITER_REQUEUED flag if it was set */
+   newval |= (uval & FUTEX_WAITER_REQUEUED);
 
pagefault_disable();

[PATCH 2.6.21-rc4-mm1 4/4] sys_futex64 : allows 64bit futexes

2007-03-21 Thread Pierre . Peiffer

This last patch is an adaptation of the sys_futex64 syscall provided in -rt
patch (originally written by Ingo Molnar). It allows the use of 64-bit futex.

I have re-worked most of the code to avoid the duplication of the code.

It does not provide the functionality for all architectures (only for x64 for 
now).

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/asm-x86_64/futex.h  |  113 ++
 include/asm-x86_64/unistd.h |4 
 include/linux/futex.h   |9 -
 include/linux/syscalls.h|3 
 kernel/futex.c  |  264 +++-
 kernel/futex_compat.c   |3 
 kernel/sys_ni.c |1 
 7 files changed, 313 insertions(+), 84 deletions(-)

Index: b/include/asm-x86_64/futex.h
===
--- a/include/asm-x86_64/futex.h
+++ b/include/asm-x86_64/futex.h
@@ -41,6 +41,39 @@
  "=&r" (tem)   \
: "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
 
+#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:" insn "\n" \
+"2:.section .fixup,\"ax\"\n\
+3: movq%3, %1\n\
+   jmp 2b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,3b\n\
+   .previous"  \
+   : "=r" (oldval), "=r" (ret), "=m" (*uaddr)  \
+   : "i" (-EFAULT), "m" (*uaddr), "0" (oparg), "1" (0))
+
+#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:movq%2, %0\n\
+   movq%0, %3\n"   \
+   insn "\n"   \
+"2:" LOCK_PREFIX "cmpxchgq %3, %2\n\
+   jnz 1b\n\
+3: .section .fixup,\"ax\"\n\
+4: movq%5, %1\n\
+   jmp 3b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,4b,2b,4b\n\
+   .previous"  \
+   : "=&a" (oldval), "=&r" (ret), "=m" (*uaddr),   \
+ "=&r" (tem)   \
+   : "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
+
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
 {
@@ -95,6 +128,60 @@ futex_atomic_op_inuser (int encoded_op, 
 }
 
 static inline int
+futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
+{
+   int op = (encoded_op >> 28) & 7;
+   int cmp = (encoded_op >> 24) & 15;
+   u64 oparg = (encoded_op << 8) >> 20;
+   u64 cmparg = (encoded_op << 20) >> 20;
+   u64 oldval = 0, ret, tem;
+
+   if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
+   oparg = 1 << oparg;
+
+   if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64)))
+   return -EFAULT;
+
+   inc_preempt_count();
+
+   switch (op) {
+   case FUTEX_OP_SET:
+   __futex_atomic_op1_64("xchgq %0, %2", ret, oldval, uaddr, 
oparg);
+   break;
+   case FUTEX_OP_ADD:
+   __futex_atomic_op1_64(LOCK_PREFIX "xaddq %0, %2", ret, oldval,
+  uaddr, oparg);
+   break;
+   case FUTEX_OP_OR:
+   __futex_atomic_op2_64("orq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   case FUTEX_OP_ANDN:
+   __futex_atomic_op2_64("andq %4, %3", ret, oldval, uaddr, 
~oparg);
+   break;
+   case FUTEX_OP_XOR:
+   __futex_atomic_op2_64("xorq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   default:
+   ret = -ENOSYS;
+   }
+
+   dec_preempt_count();
+
+   if (!ret) {
+   switch (cmp) {
+   case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
+   case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
+   case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
+   case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
+   case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
+   case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
+   default: ret = -ENOSYS;
+   }
+   }
+   return ret;
+}
+
+static inline int
 futex_at

[PATCH 2.6.21-rc4-mm1 0/4] Futexes functionalities and improvements

2007-03-21 Thread Pierre . Peiffer

Hi Andrew,

This is a re-send of a series of patches concerning futexes (here
after is a short description).
I have reworked the patches to take into account the last changes
about futex, and this series should apply cleanly on -mm tree (the changes
mostly affect patch 2 "futex_wait uses hrtimer")
I also took into account the remark of Peter Zijlstra in patch 3 
concerning futex_requeue_pi.

Could you consider (again) them for inclusion in -mm tree ?

All of them have already been discussed in January and have already 
been included in -rt for a while. I think that we agreed to potentially 
include them in the -mm tree.

And, again, Ulrich is specially interested by sys_futex64.

There are:
* futex uses prio list : allows RT-threads to be woken in priority order
instead of FIFO order.
* futex_wait uses hrtimer : allows the use of finer timer resolution.
* futex_requeue_pi functionality : allows use of requeue optimization for
PI-mutexes/PI-futexes.
* futex64 syscall : allows use of 64-bit futexes instead of 32-bit. 


Thanks,


-- 
Pierre P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4-mm1 1/4] futex priority based wakeup

2007-03-21 Thread Pierre . Peiffer

Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list in
futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: Sebastien Dugue <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 kernel/futex.c |   78 +++--
 1 file changed, 49 insertions(+), 29 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -81,12 +81,12 @@ struct futex_pi_state {
  * we can wake only the relevant ones (hashed queues may be shared).
  *
  * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when list_empty(&q->list) || q->lock_ptr == 0.
+ * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
  * The order of wakup is always to make the first condition true, then
  * wake up q->waiters, then make the second condition true.
  */
 struct futex_q {
-   struct list_head list;
+   struct plist_node list;
wait_queue_head_t waiters;
 
/* Which hash list lock to use: */
@@ -108,8 +108,8 @@ struct futex_q {
  * Split the global futex_lock into every hash list lock.
  */
 struct futex_hash_bucket {
-   spinlock_t  lock;
-   struct list_head   chain;
+   spinlock_t lock;
+   struct plist_head chain;
 };
 
 static struct futex_hash_bucket futex_queues[1<chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex(&this->key, &me->key)) {
/*
 * Another waiter already exists - bump up
@@ -513,12 +513,12 @@ lookup_pi_state(u32 uval, struct futex_h
  */
 static void wake_futex(struct futex_q *q)
 {
-   list_del_init(&q->list);
+   plist_del(&q->list, &q->list.plist);
if (q->filp)
send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
/*
 * The lock in wake_up_all() is a crucial memory barrier after the
-* list_del_init() and also before assigning to q->lock_ptr.
+* plist_del() and also before assigning to q->lock_ptr.
 */
wake_up_all(&q->waiters);
/*
@@ -633,7 +633,7 @@ static int futex_wake(u32 __user *uaddr,
 {
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
-   struct list_head *head;
+   struct plist_head *head;
union futex_key key;
int ret;
 
@@ -647,7 +647,7 @@ static int futex_wake(u32 __user *uaddr,
spin_lock(&hb->lock);
head = &hb->chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state) {
ret = -EINVAL;
@@ -675,7 +675,7 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head;
+   struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;
 
@@ -748,7 +748,7 @@ retry:
 
head = &hb1->chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key1)) {
wake_futex(this);
if (++ret >= nr_wake)
@@ -760,7 +760,7 @@ retry:
head = &hb2->chain;
 
op_ret = 0;
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key2)) {
wake_futex(this);
if (++op_ret >= nr_wake2)
@@ -787,7 +787,7 @@ static int futex_requeue(u32 __user *uad
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head1;
+   struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
 
@@ -836,7 +836,7 @@ static int futex_requeue(u32 __user *uad
}
 
head1 = &hb1->chain;
-   list_for_each_entry_safe(this, next, head1, list) {
+   plist_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
continue;
if (++re

Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-20 Thread Pierre Peiffer


Peter Zijlstra a écrit :



Unfortunately not, nonlinear vmas don't have a linear relation between
address and offset. What you would need to do is do a linear walk of the
page tables. But even that might not suffice if nonlinear vmas may form
a non-injective, surjective mapping.

/me checks..

Hmm, yes that seems valid, so in general, this reverse mapping does not
uniquely exist for non-linear vmas. :-(

What to do... disallow futexes in nonlinear mappings, 



store the address in the key?   <<


That seems to be the only solution... :-/



the vma_prio_tree would be able to give all vmas associated with a
mapping.



Thanks for your help.

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-20 Thread Pierre Peiffer


Peter Zijlstra a écrit :

+static void *get_futex_address(union futex_key *key)
+{
+   void *uaddr;
+
+   if (key->both.offset & 1) {
+   /* shared mapping */
+   uaddr = (void*)((key->shared.pgoff << PAGE_SHIFT)
+   + key->shared.offset - 1);
+   } else {
+   /* private mapping */
+   uaddr = (void*)(key->private.address + key->private.offset);
+   }
+
+   return uaddr;
+}


This will not work for nonlinear vmas, granted, not a lot of ppl stick
futexes in nonlinear vmas, but the futex_key stuff handles it, this
doesn't.


Indeed ! Thanks for pointing me to this.

Since I'm not familiar with vmm, does this code look more correct to you ?

static void *get_futex_address(union futex_key *key)
{
void *uaddr;
struct vm_area_struct *vma = current->mm->mmap;

if (key->both.offset & 1) {
/* shared mapping */
struct file * vmf;

do {
if ((vmf = vma->vm_file)
&& (key->shared.inode == vmf->f_dentry->d_inode))
break;
vma = vma->vm_next;
} while (vma);

if (likely(!(vma->vm_flags & VM_NONLINEAR)))
uaddr = (void*)((key->shared.pgoff << PAGE_SHIFT)
+ key->shared.offset - 1);
else
uaddr = (void*) vma->vm_start
+ ((key->shared.pgoff - vma->vm_pgoff)
   << PAGE_SHIFT)
+ key->shared.offset - 1;
} else {
/* private mapping */
uaddr = (void*)(key->private.address + key->private.offset);
}

return uaddr;
}

Or is there a more direct way to retrieve the vma corresponding to the given 
inode ?

Thanks,

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc3-mm2 3/4] futex_requeue_pi optimization

2007-03-13 Thread Pierre . Peiffer

This patch provides the futex_requeue_pi functionality.

This provides an optimization, already used for (normal) futexes, to be used for
PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT
mutexes too.

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/linux/futex.h   |8 
 kernel/futex.c  |  557 +++-
 kernel/futex_compat.c   |3 
 kernel/rtmutex.c|   41 ---
 kernel/rtmutex_common.h |   34 ++
 5 files changed, 555 insertions(+), 88 deletions(-)

Index: b/include/linux/futex.h
===
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -15,6 +15,7 @@
 #define FUTEX_LOCK_PI  6
 #define FUTEX_UNLOCK_PI7
 #define FUTEX_TRYLOCK_PI   8
+#define FUTEX_CMP_REQUEUE_PI   9
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -83,9 +84,14 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED   0x4000
 
 /*
+ * Some processes have been requeued on this PI-futex
+ */
+#define FUTEX_WAITER_REQUEUED  0x2000
+
+/*
  * The rest of the robust-futex field is for the TID:
  */
-#define FUTEX_TID_MASK 0x3fff
+#define FUTEX_TID_MASK 0x0fff
 
 /*
  * This limit protects against a deliberately circular list.
Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -53,6 +53,12 @@
 
 #include "rtmutex_common.h"
 
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# include "rtmutex-debug.h"
+#else
+# include "rtmutex.h"
+#endif
+
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
@@ -102,6 +108,12 @@ struct futex_q {
/* Optional priority inheritance state: */
struct futex_pi_state *pi_state;
struct task_struct *task;
+
+   /*
+* This waiter is used in case of requeue from a
+* normal futex to a PI-futex
+*/
+   struct rt_mutex_waiter waiter;
 };
 
 /*
@@ -224,6 +236,25 @@ int get_futex_key(u32 __user *uaddr, uni
 EXPORT_SYMBOL_GPL(get_futex_key);
 
 /*
+ * Retrieve the original address used to compute this key
+ */
+static void *get_futex_address(union futex_key *key)
+{
+   void *uaddr;
+
+   if (key->both.offset & 1) {
+   /* shared mapping */
+   uaddr = (void*)((key->shared.pgoff << PAGE_SHIFT)
+   + key->shared.offset - 1);
+   } else {
+   /* private mapping */
+   uaddr = (void*)(key->private.address + key->private.offset);
+   }
+
+   return uaddr;
+}
+
+/*
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
@@ -439,7 +470,8 @@ void exit_pi_state_list(struct task_stru
 }
 
 static int
-lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, struct futex_q *me)
+lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+   union futex_key *key, struct futex_pi_state **ps)
 {
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
@@ -450,7 +482,7 @@ lookup_pi_state(u32 uval, struct futex_h
head = &hb->chain;
 
plist_for_each_entry_safe(this, next, head, list) {
-   if (match_futex(&this->key, &me->key)) {
+   if (match_futex(&this->key, key)) {
/*
 * Another waiter already exists - bump up
 * the refcount and return its pi_state:
@@ -465,7 +497,7 @@ lookup_pi_state(u32 uval, struct futex_h
WARN_ON(!atomic_read(&pi_state->refcount));
 
atomic_inc(&pi_state->refcount);
-   me->pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
}
@@ -492,7 +524,7 @@ lookup_pi_state(u32 uval, struct futex_h
rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p);
 
/* Store the key for possible exit cleanups: */
-   pi_state->key = me->key;
+   pi_state->key = *key;
 
spin_lock_irq(&p->pi_lock);
WARN_ON(!list_empty(&pi_state->list));
@@ -502,7 +534,7 @@ lookup_pi_state(u32 uval, struct futex_h
 
put_task_struct(p);
 
-   me->pi_state = pi_state;
+   *ps = pi_state;
 
return 0;
 }
@@ -561,6 +593,8 @@ static int wake_futex_pi(u32 __user *uad
 */
if (!(uval & FUTEX_OWNER_DIED)) {
newval = FUTEX_WAITERS | new_owner->pid;
+   /* Keep the FUTEX_WAITER_REQUEUED flag if it was set */
+   newval |= (uval & FUTEX_WAITER_REQUEUED);
 
pagefault_disable();

[PATCH 2.6.21-rc3-mm2 4/4] sys_futex64 : allows 64bit futexes

2007-03-13 Thread Pierre . Peiffer

This last patch is an adaptation of the sys_futex64 syscall provided in -rt
patch (originally written by Ingo). It allows the use of 64bit futex.

I have re-worked most of the code to avoid the duplication of the code.

It does not provide the functionality for all architectures (only for x64 for 
now).

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/asm-x86_64/futex.h  |  113 
 include/asm-x86_64/unistd.h |4 
 include/linux/futex.h   |7 -
 include/linux/syscalls.h|3 
 kernel/futex.c  |  248 +++-
 kernel/futex_compat.c   |3 
 kernel/sys_ni.c |1 
 7 files changed, 301 insertions(+), 78 deletions(-)

Index: b/include/asm-x86_64/futex.h
===
--- a/include/asm-x86_64/futex.h
+++ b/include/asm-x86_64/futex.h
@@ -41,6 +41,39 @@
  "=&r" (tem)   \
: "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
 
+#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:" insn "\n" \
+"2:.section .fixup,\"ax\"\n\
+3: movq%3, %1\n\
+   jmp 2b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,3b\n\
+   .previous"  \
+   : "=r" (oldval), "=r" (ret), "=m" (*uaddr)  \
+   : "i" (-EFAULT), "m" (*uaddr), "0" (oparg), "1" (0))
+
+#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:movq%2, %0\n\
+   movq%0, %3\n"   \
+   insn "\n"   \
+"2:" LOCK_PREFIX "cmpxchgq %3, %2\n\
+   jnz 1b\n\
+3: .section .fixup,\"ax\"\n\
+4: movq%5, %1\n\
+   jmp 3b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,4b,2b,4b\n\
+   .previous"  \
+   : "=&a" (oldval), "=&r" (ret), "=m" (*uaddr),   \
+ "=&r" (tem)   \
+   : "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
+
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
 {
@@ -95,6 +128,60 @@ futex_atomic_op_inuser (int encoded_op, 
 }
 
 static inline int
+futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
+{
+   int op = (encoded_op >> 28) & 7;
+   int cmp = (encoded_op >> 24) & 15;
+   u64 oparg = (encoded_op << 8) >> 20;
+   u64 cmparg = (encoded_op << 20) >> 20;
+   u64 oldval = 0, ret, tem;
+
+   if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
+   oparg = 1 << oparg;
+
+   if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64)))
+   return -EFAULT;
+
+   inc_preempt_count();
+
+   switch (op) {
+   case FUTEX_OP_SET:
+   __futex_atomic_op1_64("xchgq %0, %2", ret, oldval, uaddr, 
oparg);
+   break;
+   case FUTEX_OP_ADD:
+   __futex_atomic_op1_64(LOCK_PREFIX "xaddq %0, %2", ret, oldval,
+  uaddr, oparg);
+   break;
+   case FUTEX_OP_OR:
+   __futex_atomic_op2_64("orq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   case FUTEX_OP_ANDN:
+   __futex_atomic_op2_64("andq %4, %3", ret, oldval, uaddr, 
~oparg);
+   break;
+   case FUTEX_OP_XOR:
+   __futex_atomic_op2_64("xorq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   default:
+   ret = -ENOSYS;
+   }
+
+   dec_preempt_count();
+
+   if (!ret) {
+   switch (cmp) {
+   case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
+   case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
+   case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
+   case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
+   case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
+   case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
+   default: ret = -ENOSYS;
+   }
+   }
+   return ret;
+}
+
+static inline int
 futex_atomic_cmpxchg_inatom

[PATCH 2.6.21-rc3-mm2 0/4] Futexes functionalities and improvements

2007-03-13 Thread Pierre . Peiffer

Hi Andrew,

This is a re-send of a series of patches concerning futexes (here after 
is a short description)

Could you consider them for inclusion in -mm tree ?

All of them have already been discussed in January and have already 
been included in -rt for a while. I think that we agreed to potentially include 
them in the -mm tree.

Ulrich is specially interested by sys_futex64.

There are:
* futex uses prio list : allows RT-threads to be woken in priority order
instead of FIFO order.
* futex_wait uses hrtimer : allows the use of finer timer resolution.
* futex_requeue_pi functionality : allows use of requeue optimization for
PI-mutexes/PI-futexes.
* futex64 syscall : allows use of 64-bit futexes instead of 32-bit. 


Note: it does not inlcude the fix "PI state locking fix" sent yesterday by Ingo.

Thanks,

-- 
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc3-mm2 1/4] futex priority based wakeup

2007-03-13 Thread Pierre . Peiffer

Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list in
futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: SÃ©bastien DuguÃ© <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 kernel/futex.c |   78 +++--
 1 file changed, 49 insertions(+), 29 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -81,12 +81,12 @@ struct futex_pi_state {
  * we can wake only the relevant ones (hashed queues may be shared).
  *
  * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when list_empty(&q->list) || q->lock_ptr == 0.
+ * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
  * The order of wakup is always to make the first condition true, then
  * wake up q->waiters, then make the second condition true.
  */
 struct futex_q {
-   struct list_head list;
+   struct plist_node list;
wait_queue_head_t waiters;
 
/* Which hash list lock to use: */
@@ -108,8 +108,8 @@ struct futex_q {
  * Split the global futex_lock into every hash list lock.
  */
 struct futex_hash_bucket {
-   spinlock_t  lock;
-   struct list_head   chain;
+   spinlock_t lock;
+   struct plist_head chain;
 };
 
 static struct futex_hash_bucket futex_queues[1<chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex(&this->key, &me->key)) {
/*
 * Another waiter already exists - bump up
@@ -513,12 +513,12 @@ lookup_pi_state(u32 uval, struct futex_h
  */
 static void wake_futex(struct futex_q *q)
 {
-   list_del_init(&q->list);
+   plist_del(&q->list, &q->list.plist);
if (q->filp)
send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
/*
 * The lock in wake_up_all() is a crucial memory barrier after the
-* list_del_init() and also before assigning to q->lock_ptr.
+* plist_del() and also before assigning to q->lock_ptr.
 */
wake_up_all(&q->waiters);
/*
@@ -631,7 +631,7 @@ static int futex_wake(u32 __user *uaddr,
 {
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
-   struct list_head *head;
+   struct plist_head *head;
union futex_key key;
int ret;
 
@@ -645,7 +645,7 @@ static int futex_wake(u32 __user *uaddr,
spin_lock(&hb->lock);
head = &hb->chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state) {
ret = -EINVAL;
@@ -673,7 +673,7 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head;
+   struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;
 
@@ -746,7 +746,7 @@ retry:
 
head = &hb1->chain;
 
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key1)) {
wake_futex(this);
if (++ret >= nr_wake)
@@ -758,7 +758,7 @@ retry:
head = &hb2->chain;
 
op_ret = 0;
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key2)) {
wake_futex(this);
if (++op_ret >= nr_wake2)
@@ -785,7 +785,7 @@ static int futex_requeue(u32 __user *uad
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head1;
+   struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
 
@@ -834,7 +834,7 @@ static int futex_requeue(u32 __user *uad
}
 
head1 = &hb1->chain;
-   list_for_each_entry_safe(this, next, head1, list) {
+   plist_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (&this->key, &key1))
continue;
if (++re

[PATCH 2.6.21-rc3-mm2 2/4] Make futex_wait() use an hrtimer for timeout

2007-03-13 Thread Pierre . Peiffer

This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

  schedule_timeout() is tick based, therefore the timeout granularity is
the tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution
timer for timeout wakeup, we can attain a much finer timeout granularity
(in the microsecond range). This parallels what is already done for
futex_lock_pi().

  The timeout passed to the syscall is no longer converted to jiffies
and is therefore passed to do_futex() and futex_wait() as a timespec
therefore keeping nanosecond resolution.

  Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

  In futex_wait(), if there is no timeout then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.

Signed-off-by: SÃ©bastien DuguÃ© <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
 include/linux/futex.h |2 -
 kernel/futex.c|   59 +-
 kernel/futex_compat.c |   12 ++
 3 files changed, 43 insertions(+), 30 deletions(-)

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -998,7 +998,7 @@ static void unqueue_me_pi(struct futex_q
drop_futex_key_refs(&q->key);
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, struct timespec *time)
 {
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1006,6 +1006,8 @@ static int futex_wait(u32 __user *uaddr,
struct futex_q q;
u32 uval;
int ret;
+   struct hrtimer_sleeper t;
+   int rem = 0;
 
q.pi_state = NULL;
  retry:
@@ -1083,8 +1085,31 @@ static int futex_wait(u32 __user *uaddr,
 * !plist_node_empty() is safe here without any lock.
 * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
 */
-   if (likely(!plist_node_empty(&q.list)))
-   time = schedule_timeout(time);
+   if (likely(!plist_node_empty(&q.list))) {
+   if (!time)
+   schedule();
+   else {
+   hrtimer_init(&t.timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_REL);
+   hrtimer_init_sleeper(&t, current);
+   t.timer.expires = timespec_to_ktime(*time);
+
+   hrtimer_start(&t.timer, t.timer.expires, 
HRTIMER_MODE_REL);
+
+   /*
+* the timer could have already expired, in which
+* case current would be flagged for rescheduling.
+* Don't bother calling schedule.
+*/
+   if (likely(t.task))
+   schedule();
+
+   hrtimer_cancel(&t.timer);
+
+   /* Flag if a timeout occured */
+   rem = (t.task == NULL);
+   }
+   }
+
__set_current_state(TASK_RUNNING);
 
/*
@@ -1095,7 +1120,7 @@ static int futex_wait(u32 __user *uaddr,
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!unqueue_me(&q))
return 0;
-   if (time == 0)
+   if (rem)
return -ETIMEDOUT;
/*
 * We expect signal_pending(current), but another thread may
@@ -1117,8 +1142,8 @@ static int futex_wait(u32 __user *uaddr,
  * if there are waiters then it will block, it does PI, etc. (Due to
  * races the kernel might see a 0 value of the futex too.)
  */
-static int futex_lock_pi(u32 __user *uaddr, int detect, unsigned long sec,
-long nsec, int trylock)
+static int futex_lock_pi(u32 __user *uaddr, int detect, struct timespec *time,
+int trylock)
 {
struct hrtimer_sleeper timeout, *to = NULL;
struct task_struct *curr = current;
@@ -1130,11 +1155,11 @@ static int futex_lock_pi(u32 __user *uad
if (refill_pi_state_cache())
return -ENOMEM;
 
-   if (sec != MAX_SCHEDULE_TIMEOUT) {
+   if (time) {
to = &timeout;
hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
-   to->timer.expires = ktime_set(sec, nsec);
+   to->timer.expires = timespec_to_ktime(*time);
}
 
q.pi_state = NULL;
@@ -1770,7 +1795,7 @@ void exit_robust_list(struct task_struct
}
 }
 
-long do_futex(u32 __user *uaddr, int op, u32 val, unsigned long timeout,
+long do_futex(u32 __user *uaddr, int op, u32 val, struct timespec *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
 {
int ret;
@@ -1796,13 +1821,13 @@ long do_futex(u32 __user *uaddr,

Re: How to distinguish original kernel vs -rt kernel

2007-03-07 Thread Pierre Peiffer


Thomas Gleixner a écrit :



It is HRTIMER_MODE_xx in mainline as of 2.6.21-rc1. -rt kernels are
always a bit ahead of time. :)


Great !
Thanks.

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

How to distinguish original kernel vs -rt kernel

2007-03-07 Thread Pierre Peiffer


Hi,

Supposing I have an external kernel module which I would like to compile against 
both original kernel and -rt kernel, what is the proper/most elegant way to know 
which kernel I'm compiling with ?

I've only found the EXTRAVERSION define, am I missing a better way ?

In fact, I'm facing the problem of HRTIMER_ABS/REL being renamed to 
HRTIMER_MODE_ABS/REL with patch -rt. Is there a reason of this ?


Does anyone have an objection of keeping it the same (let's say 
HRTIMER_ABS/REL) in kernel -rt ?


Thanks,

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] futex null pointer timeout

2007-01-18 Thread Pierre Peiffer


Ingo Molnar a écrit :

* Daniel Walker <[EMAIL PROTECTED]> wrote:


[...]
The patch reworks do_futex, and futex_wait* so a NULL pointer in the 
timeout position is infinite, and anything else is evaluated as a real 
timeout.


thanks, applied.



On top of this patch, you will need the following patch: futex_lock_pi is also 
involved.


---
 futex.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
---

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
Index: linux-2.6/kernel/futex.c
===
--- linux-2.6.orig/kernel/futex.c   2007-01-18 13:16:32.0 +0100
+++ linux-2.6/kernel/futex.c2007-01-18 13:19:32.0 +0100
@@ -1644,7 +1644,7 @@ static int futex_lock_pi(u32 __user *uad
if (refill_pi_state_cache())
return -ENOMEM;

-   if (time->tv_sec || time->tv_nsec) {
+   if (time) {
to = &timeout;
hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
@@ -3197,7 +3197,7 @@ static int futex_lock_pi64(u64 __user *u
if (refill_pi_state_cache())
return -ENOMEM;

-   if (time->tv_sec || time->tv_nsec) {
+   if (time) {
to = &timeout;
hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.20-rc5 4/4] sys_futex64 : allows 64bit futexes

2007-01-17 Thread Pierre Peiffer


Hi,

This latest patch is an adaptation of the sys_futex64 syscall provided in -rt
patch (originally written by Ingo). It allows the use of 64bit futex.

I have re-worked most of the code to avoid the duplication of the code.

It does not provide the functionality for all architectures, and thus, it can
not be applied "as is".
Feedbacks and comments are welcome.

---
 include/asm-x86_64/futex.h  |  113 
 include/asm-x86_64/unistd.h |4
 include/linux/futex.h   |5
 include/linux/syscalls.h|3
 kernel/futex.c  |  247 ++--
 kernel/futex_compat.c   |3
 kernel/sys_ni.c |1
 7 files changed, 299 insertions(+), 77 deletions(-)

---

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-x86_64/futex.h
===
--- linux-2.6.orig/include/asm-x86_64/futex.h   2007-01-17 09:39:54.0 
+0100
+++ linux-2.6/include/asm-x86_64/futex.h2007-01-17 09:44:57.0 
+0100
@@ -41,6 +41,39 @@
  "=&r" (tem) \
: "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))

+#define __futex_atomic_op1_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:   " insn "\n"  \
+"2:   .section .fixup,\"ax\"\n\
+3: movq%3, %1\n\
+   jmp 2b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,3b\n\
+   .previous" \
+   : "=r" (oldval), "=r" (ret), "=m" (*uaddr)\
+   : "i" (-EFAULT), "m" (*uaddr), "0" (oparg), "1" (0))
+
+#define __futex_atomic_op2_64(insn, ret, oldval, uaddr, oparg) \
+  __asm__ __volatile ( \
+"1:   movq%2, %0\n\
+   movq%0, %3\n"  \
+   insn "\n" \
+"2:   " LOCK_PREFIX "cmpxchgq %3, %2\n\
+   jnz 1b\n\
+3: .section .fixup,\"ax\"\n\
+4: movq%5, %1\n\
+   jmp 3b\n\
+   .previous\n\
+   .section __ex_table,\"a\"\n\
+   .align  8\n\
+   .quad   1b,4b,2b,4b\n\
+   .previous" \
+   : "=&a" (oldval), "=&r" (ret), "=m" (*uaddr), \
+ "=&r" (tem) \
+   : "r" (oparg), "i" (-EFAULT), "m" (*uaddr), "1" (0))
+
 static inline int
 futex_atomic_op_inuser (int encoded_op, int __user *uaddr)
 {
@@ -95,6 +128,60 @@ futex_atomic_op_inuser (int encoded_op,
 }

 static inline int
+futex_atomic_op_inuser64 (int encoded_op, u64 __user *uaddr)
+{
+   int op = (encoded_op >> 28) & 7;
+   int cmp = (encoded_op >> 24) & 15;
+   u64 oparg = (encoded_op << 8) >> 20;
+   u64 cmparg = (encoded_op << 20) >> 20;
+   u64 oldval = 0, ret, tem;
+
+   if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
+   oparg = 1 << oparg;
+
+   if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u64)))
+   return -EFAULT;
+
+   inc_preempt_count();
+
+   switch (op) {
+   case FUTEX_OP_SET:
+   __futex_atomic_op1_64("xchgq %0, %2", ret, oldval, uaddr, 
oparg);
+   break;
+   case FUTEX_OP_ADD:
+   __futex_atomic_op1_64(LOCK_PREFIX "xaddq %0, %2", ret, oldval,
+  uaddr, oparg);
+   break;
+   case FUTEX_OP_OR:
+   __futex_atomic_op2_64("orq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   case FUTEX_OP_ANDN:
+   __futex_atomic_op2_64("andq %4, %3", ret, oldval, uaddr, 
~oparg);
+   break;
+   case FUTEX_OP_XOR:
+   __futex_atomic_op2_64("xorq %4, %3", ret, oldval, uaddr, oparg);
+   break;
+   default:
+   ret = -ENOSYS;
+   }
+
+   dec_preempt_count();
+
+   if (!ret) {
+   switch (cmp) {
+   case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
+   case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
+   case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
+   case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
+   case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
+   case FUTEX_OP_C

[PATCH 2.6.20-rc5 3/4] futex_requeue_pi optimization

2007-01-17 Thread Pierre Peiffer


Hi,

This patch provides the futex_requeue_pi functionality.

This provides an optimization, already used for (normal) futexes, to be used for
PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes. With futex_requeue_pi, it can be used with PRIO_INHERIT
mutexes too.

---

 include/linux/futex.h   |8
 kernel/futex.c  |  557 +++-
 kernel/futex_compat.c   |3
 kernel/rtmutex.c|   41 ---
 kernel/rtmutex_common.h |   34 ++
 5 files changed, 555 insertions(+), 88 deletions(-)

---

Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/linux/futex.h
===
--- linux-2.6.orig/include/linux/futex.h2007-01-17 09:44:42.0 
+0100
+++ linux-2.6/include/linux/futex.h 2007-01-17 09:44:47.0 +0100
@@ -15,6 +15,7 @@
 #define FUTEX_LOCK_PI  6
 #define FUTEX_UNLOCK_PI7
 #define FUTEX_TRYLOCK_PI   8
+#define FUTEX_CMP_REQUEUE_PI   9

 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -83,9 +84,14 @@ struct robust_list_head {
 #define FUTEX_OWNER_DIED   0x4000

 /*
+ * Some processes have been requeued on this PI-futex
+ */
+#define FUTEX_WAITER_REQUEUED  0x2000
+
+/*
  * The rest of the robust-futex field is for the TID:
  */
-#define FUTEX_TID_MASK 0x3fff
+#define FUTEX_TID_MASK 0x0fff

 /*
  * This limit protects against a deliberately circular list.
Index: linux-2.6/kernel/futex.c
===
--- linux-2.6.orig/kernel/futex.c   2007-01-17 09:44:42.0 +0100
+++ linux-2.6/kernel/futex.c2007-01-17 09:44:47.0 +0100
@@ -52,6 +52,12 @@

 #include "rtmutex_common.h"

+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# include "rtmutex-debug.h"
+#else
+# include "rtmutex.h"
+#endif
+
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)

 /*
@@ -127,6 +133,12 @@ struct futex_q {
/* Optional priority inheritance state: */
struct futex_pi_state *pi_state;
struct task_struct *task;
+
+   /*
+* This waiter is used in case of requeue from a
+* normal futex to a PI-futex
+*/
+   struct rt_mutex_waiter waiter;
 };

 /*
@@ -248,6 +260,25 @@ static int get_futex_key(u32 __user *uad
 }

 /*
+ * Retrieve the original address used to compute this key
+ */
+static void *get_futex_address(union futex_key *key)
+{
+   void *uaddr;
+
+   if (key->both.offset & 1) {
+   /* shared mapping */
+   uaddr = (void*)((key->shared.pgoff << PAGE_SHIFT)
+   + key->shared.offset - 1);
+   } else {
+   /* private mapping */
+   uaddr = (void*)(key->private.address + key->private.offset);
+   }
+
+   return uaddr;
+}
+
+/*
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
@@ -461,7 +492,8 @@ void exit_pi_state_list(struct task_stru
 }

 static int
-lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, struct futex_q *me)
+lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+   union futex_key *key, struct futex_pi_state **ps)
 {
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
@@ -472,7 +504,7 @@ lookup_pi_state(u32 uval, struct futex_h
head = &hb->chain;

plist_for_each_entry_safe(this, next, head, list) {
-   if (match_futex(&this->key, &me->key)) {
+   if (match_futex(&this->key, key)) {
/*
 * Another waiter already exists - bump up
 * the refcount and return its pi_state:
@@ -487,7 +519,7 @@ lookup_pi_state(u32 uval, struct futex_h
WARN_ON(!atomic_read(&pi_state->refcount));

atomic_inc(&pi_state->refcount);
-   me->pi_state = pi_state;
+   *ps = pi_state;

return 0;
}
@@ -514,7 +546,7 @@ lookup_pi_state(u32 uval, struct futex_h
rt_mutex_init_proxy_locked(&pi_state->pi_mutex, p);

/* Store the key for possible exit cleanups: */
-   pi_state->key = me->key;
+   pi_state->key = *key;

spin_lock_irq(&p->pi_lock);
WARN_ON(!list_empty(&pi_state->list));
@@ -524,7 +556,7 @@ lookup_pi_state(u32 uval, struct futex_h

put_task_struct(p);

-   me->pi_state = pi_state;
+   *ps = pi_state;

return 0;
 }
@@ -583,6 +615,8 @@ static int wake_futex_pi(u32 __user *uad
 */
if (!(uval & FUTEX_OWNER_DIED)) {
newval = FUTEX_WAITERS | new_owner->pid;
+

[PATCH 2.6.20-rc5 2/4] Make futex_wait() use an hrtimer for timeout

2007-01-17 Thread Pierre Peiffer


  Hi,

  This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout() in an RT kernel.

  More details in the patch header.




  This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

  schedule_timeout() is tick based, therefore the timeout granularity is
the tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution
timer for timeout wakeup, we can attain a much finer timeout granularity
(in the microsecond range). This parallels what is already done for
futex_lock_pi().

  The timeout passed to the syscall is no longer converted to jiffies
and is therefore passed to do_futex() and futex_wait() as a timespec
therefore keeping nanosecond resolution.

  Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

  In futex_wait(), if the timeout is zero then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.

---
 include/linux/futex.h |2 -
 kernel/futex.c|   58 --
 kernel/futex_compat.c |   11 +
 3 files changed, 41 insertions(+), 30 deletions(-)

---

Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
Index: linux-2.6/kernel/futex.c
===
--- linux-2.6.orig/kernel/futex.c   2007-01-17 09:44:25.0 +0100
+++ linux-2.6/kernel/futex.c2007-01-17 09:44:42.0 +0100
@@ -1020,7 +1020,7 @@ static void unqueue_me_pi(struct futex_q
drop_key_refs(&q->key);
 }

-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, struct timespec *time)
 {
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1028,6 +1028,8 @@ static int futex_wait(u32 __user *uaddr,
struct futex_q q;
u32 uval;
int ret;
+   struct hrtimer_sleeper t;
+   int rem = 0;

q.pi_state = NULL;
  retry:
@@ -1105,8 +1107,31 @@ static int futex_wait(u32 __user *uaddr,
 * !plist_node_empty() is safe here without any lock.
 * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
 */
-   if (likely(!plist_node_empty(&q.list)))
-   time = schedule_timeout(time);
+   if (likely(!plist_node_empty(&q.list))) {
+   if (time->tv_sec == 0 && time->tv_nsec == 0)
+   schedule();
+   else {
+   hrtimer_init(&t.timer, CLOCK_MONOTONIC, HRTIMER_REL);
+   hrtimer_init_sleeper(&t, current);
+   t.timer.expires = timespec_to_ktime(*time);
+
+   hrtimer_start(&t.timer, t.timer.expires, HRTIMER_REL);
+
+   /*
+* the timer could have already expired, in which
+* case current would be flagged for rescheduling.
+* Don't bother calling schedule.
+*/
+   if (likely(t.task))
+   schedule();
+
+   hrtimer_cancel(&t.timer);
+
+   /* Flag if a timeout occured */
+   rem = (t.task == NULL);
+   }
+   }
+
__set_current_state(TASK_RUNNING);

/*
@@ -1117,7 +1142,7 @@ static int futex_wait(u32 __user *uaddr,
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!unqueue_me(&q))
return 0;
-   if (time == 0)
+   if (rem)
return -ETIMEDOUT;
/*
 * We expect signal_pending(current), but another thread may
@@ -1139,8 +1164,8 @@ static int futex_wait(u32 __user *uaddr,
  * if there are waiters then it will block, it does PI, etc. (Due to
  * races the kernel might see a 0 value of the futex too.)
  */
-static int futex_lock_pi(u32 __user *uaddr, int detect, unsigned long sec,
-long nsec, int trylock)
+static int futex_lock_pi(u32 __user *uaddr, int detect, struct timespec *time,
+int trylock)
 {
struct hrtimer_sleeper timeout, *to = NULL;
struct task_struct *curr = current;
@@ -1152,11 +1177,11 @@ static int futex_lock_pi(u32 __user *uad
if (refill_pi_state_cache())
return -ENOMEM;

-   if (sec != MAX_SCHEDULE_TIMEOUT) {
+   if (time->tv_sec || time->tv_nsec) {
to = &timeout;
hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_ABS);
hrtimer_init_sleeper(to, current);
-   to->timer.expires = ktime_set(sec, nsec);
+   to->timer.expires = t

[PATCH 2.6.20-rc5 1/4] futex priority based wakeup

2007-01-17 Thread Pierre Peiffer


Hi,

Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list in
futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

---

 futex.c |   78 
 1 file changed, 49 insertions(+), 29 deletions(-)

---

Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]>
Signed-off-by: Pierre Peiffer <[EMAIL PROTECTED]>

---
Index: linux-2.6/kernel/futex.c
===
--- linux-2.6.orig/kernel/futex.c   2007-01-17 09:39:57.0 +0100
+++ linux-2.6/kernel/futex.c2007-01-17 09:44:25.0 +0100
@@ -106,12 +106,12 @@ struct futex_pi_state {
  * we can wake only the relevant ones (hashed queues may be shared).
  *
  * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when list_empty(&q->list) || q->lock_ptr == 0.
+ * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
  * The order of wakup is always to make the first condition true, then
  * wake up q->waiters, then make the second condition true.
  */
 struct futex_q {
-   struct list_head list;
+   struct plist_node list;
wait_queue_head_t waiters;

/* Which hash list lock to use: */
@@ -133,8 +133,8 @@ struct futex_q {
  * Split the global futex_lock into every hash list lock.
  */
 struct futex_hash_bucket {
-   spinlock_t  lock;
-   struct list_head   chain;
+   spinlock_t lock;
+   struct plist_head chain;
 };

 static struct futex_hash_bucket futex_queues[1<chain;

-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex(&this->key, &me->key)) {
/*
 * Another waiter already exists - bump up
@@ -535,12 +535,12 @@ lookup_pi_state(u32 uval, struct futex_h
  */
 static void wake_futex(struct futex_q *q)
 {
-   list_del_init(&q->list);
+   plist_del(&q->list, &q->list.plist);
if (q->filp)
send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
/*
 * The lock in wake_up_all() is a crucial memory barrier after the
-* list_del_init() and also before assigning to q->lock_ptr.
+* plist_del() and also before assigning to q->lock_ptr.
 */
wake_up_all(&q->waiters);
/*
@@ -653,7 +653,7 @@ static int futex_wake(u32 __user *uaddr,
 {
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
-   struct list_head *head;
+   struct plist_head *head;
union futex_key key;
int ret;

@@ -667,7 +667,7 @@ static int futex_wake(u32 __user *uaddr,
spin_lock(&hb->lock);
head = &hb->chain;

-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state) {
ret = -EINVAL;
@@ -695,7 +695,7 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head;
+   struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;

@@ -768,7 +768,7 @@ retry:

head = &hb1->chain;

-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key1)) {
wake_futex(this);
if (++ret >= nr_wake)
@@ -780,7 +780,7 @@ retry:
head = &hb2->chain;

op_ret = 0;
-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key2)) {
wake_futex(this);
if (++op_ret >= nr_wake2)
@@ -807,7 +807,7 @@ static int futex_requeue(u32 __user *uad
 {
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
-   struct list_head *head1;
+   struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;

@@ -856,7 +856,7 @@ static int futex_requeue(u32 __user *uad
}

head1 = &hb1->chain;
-   list_for_each_entry_safe(this, next, head1, list) {
+   plist_for_each_entry_safe(this, next, head1, list) {

[PATCH 2.6.20-rc5 0/4] futexes functionalities and improvements

2007-01-17 Thread Pierre Peiffer


Hi,

Today, there are several functionalities or improvements about futexes 
included
in -rt kernel tree, which, I think, it make sense to have in mainline.

Among them, there are:
* futex use prio list : allows RT-threads to be woken in priority order
instead of FIFO order.
* futex_wait use hrtimer : allows the use of finer timer resolution.
* futex_requeue_pi functionality : allows use of requeue optimization for
PI-mutexes/PI-futexes.
* futex64 syscall : allows use of 64-bit futexes instead of 32-bit.

The following mails provide the corresponding patches.


I re-send this series for kernel 2.6.20-rc5 with this small modifications:

 - futex_use_prio_list patch stores now all non-real-time threads with the same
priority (MAX_RT_PRIO, which is a lower priority than real-time priorities),
causing them to be stored in FIFO order. RT-threads are still woken first in
priority order.
 - futex_requeue_pi: I've found (and corrected of course) a bug causing a
memory leak.

plist (patch 1/4) is still under discussion: I think it should be taken into
account, because it concerns a correctness issue with a very low cost as
drawback (I would even say "without noticeable cost" ;-) but that's my opinion
of course).
Anyway, I still can provide the same series without patch 1/4 if needed.

Comments and feedback are still welcome, as usual.

--
Pierre

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.20-rc4 0/4] futexes functionalities and improvements

2007-01-16 Thread Pierre Peiffer


Ingo Molnar a écrit :

* Ulrich Drepper <[EMAIL PROTECTED]> wrote:


what do you mean by that - which is this same resource?
From what has been said here before, all futexes are stored in the 
same list or hash table or whatever it was.  I want to see how that 
code behaves if many separate processes concurrently use futexes.


futexes are stored in the bucket hash, and these patches do not change 
that. The pi-list that was talked about is per-futex. So there's no 
change to the way futexes are hashed nor should there be any scalability 
impact - besides the micro-impact that was measured in a number of ways 
- AFAICS.


Yes, that's completely right !

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.20-rc4 0/4] futexes functionalities and improvements

2007-01-16 Thread Pierre Peiffer


Hi,

Ingo Molnar a écrit :
yeah. As an alternative, it might be a good idea to pthread-ify 
hackbench.c - that should replicate the Volano workload pretty 
accurately. I've attached hackbench.c. (it's process based right now, so 
it wont trigger contended futex ops)


Ok, thanks. I've adapted your test, Ingo, and do some measures. (I've only 
replaced fork with pthread_create, I didn't use condvar or barrier for the first 
synchronization).

The modified hackbench is available here:

http://www.bullopensource.org/posix/pi-futex/hackbench_pth.c

I've run this bench 1000 times with pipe and 800 groups.
Here are the results:

Test1 - with simple list (i.e. without any futex patches)
=
Iterations=1000
Latency (s)  min  max  avg  stddev
26.6727.8927.140.19

Test2 - with plist (i.e. with only patch 1/4 as is)
===
Iterations=1000
Latency (s)  min  max  avg  stddev
26.8728.1827.300.18

Test3 - with plist but all SHED_OTHER registered
with the same priority (MAX_RT_PRIO)
(i.e. with modified patch 1/4, patch not yet posted here)
=
Iterations=1000
Latency (s)  min  max  avg  stddev
26.7427.8427.160.18


--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.20-rc4 0/4] futexes functionalities and improvements

2007-01-11 Thread Pierre Peiffer

Andrew Morton a écrit :
> OK.  Unfortunately patches 2-4 don't apply without #1 present and the fix
> is not immediately obvious, so we'll need a respin+retest, please.

Ok, I'll provide updated patches for -mm ASAP.

On Thu, 11 Jan 2007 09:47:28 -0800
Ulrich Drepper <[EMAIL PROTECTED]> wrote:

if the patches allow this, I'd like to see parts 2, 3, and 4 to be in
-mm ASAP.  Especially the 64-bit variants are urgently needed.  Just
hold off adding the plist use, I am still not convinced that
unconditional use is a good thing, especially with one single global list.

Just to avoid any misunderstanding (I (really) understand your point about 
performance issue), but:

* the problem I mention about several futexes hashed on the same key, and thus 
with all potential waiters listed on the same list, is _not_ a new problem which 
comes with this patch: it already exists today, with simple list.

* the measures of performance done with pthread_broadcast (and thus with 
futex_requeue) is a good choice (well, may be not realistic, when considering 
real applications (*)) to put in evidence the performance impact, rather than 
threads making FUTEX_WAIT/FUTEX_WAKE: what is expensive with plist is the 
plist_add operation (which occurs in FUTEX_WAIT), not plist_del (which occurs 
during FUTEX_WAKE => thus, no big impact should be noticed here). Any measure 
will be difficult to do with only FUTEX_WAIT/WAKE.

=> futex_requeue does as many plist_del/plist_add operations as the number of 
threads waiting (minus 1), and thus has a direct impact on the time needed to 
wake everybody (or to wake the first thread to be more precise).

(*) I'll try the volano bench, if I have time.

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.20-rc4 1/4] futex priority based wakeup

2007-01-10 Thread Pierre Peiffer


Daniel Walker a écrit :

On Tue, 2007-01-09 at 17:16 +0100, Pierre Peiffer wrote:

@@ -1358,7 +1366,7 @@ static int futex_unlock_pi(u32 __user *u
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
u32 uval;
-   struct list_head *head;
+   struct plist_head *head;
union futex_key key;
int ret, attempt = 0;

@@ -1409,7 +1417,7 @@ retry_locked:
 */
head = &hb->chain;

-   list_for_each_entry_safe(this, next, head, list) {
+   plist_for_each_entry_safe(this, next, head, list) {
if (!match_futex (&this->key, &key))
continue;
ret = wake_futex_pi(uaddr, uval, this);



Is this really necessary? The rtmutex will priority sort the waiters
when you enable priority inheritance. Inside the wake_futex_pi() it
actually just pulls the new owner off another plist inside the the
rtmutex structure.


Yes. ... necessary for non-PI-futex (ie "normal" futex)...

As the hash_bucket_list is used and common for both futex and PI-futex, yes, in 
case of PI_futex, the task is queued two times in two plist.


--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 115 matches

Mail list logo