Re: [PATCH v4 9/9] test: IPC message queue migration test

2012-09-11 Thread Manfred Spraul

Hi Stanislav,

sorry for the slow reply, I finally got the time to perform tests with 
your patches.



On 08/13/2012 02:32 PM, Stanislav Kinsbursky wrote:

This test is a part of CRIU development test suit.
---
  tools/testing/selftests/ipc/msgque.c |  151 ++
  1 files changed, 151 insertions(+), 0 deletions(-)
  create mode 100644 tools/testing/selftests/ipc/msgque.c

diff --git a/tools/testing/selftests/ipc/msgque.c 
b/tools/testing/selftests/ipc/msgque.c
new file mode 100644
index 000..c101e25
--- /dev/null
+++ b/tools/testing/selftests/ipc/msgque.c
@@ -0,0 +1,151 @@
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "zdtmtst.h"

- This file "zdtmtst.h" seems to be missing
- There is no "Makefile" in this folder
- The folder "ipc" is not aded to tools/testing/selftests/Makefile.

---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 8/9] IPC: message queue copy feature introduced

2012-09-11 Thread Manfred Spraul

On 08/13/2012 02:32 PM, Stanislav Kinsbursky wrote:

This patch is required for checkpoint/restore in userspace.
IOW, c/r requires some way to get all pending IPC messages without deleting
them from the queue (checkpoint can fail and in this case tasks will be resumed,
so queue have to be valid).
To achive this, new operation flag MSG_COPY for sys_msgrcv() system call was
introduced. If this flag was specified, then mtype is interpreted as number of
the message to copy.
If MSG_COPY is set, then kernel will allocate dummy message with passed size,
and then use new copy_msg() helper function to copy desired message (instead of
unlinking it from the queue).

Notes:
1) Return -ENOSYS if MSG_COPY is specified, but CONFIG_CHECKPOINT_RESTORE is
not set.

How is CONFIG_CHECKPOINT_RESTORE set?
I'm not sure, but I think it should be added to Kconfig.

--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ipc/sem.c: synchronize semop and semctl with IPC_RMID

2013-10-03 Thread Manfred Spraul
After acquiring the semlock spinlock, operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but should
  be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
  access memory after free.

The patch also:
- standardizes the tests for .deleted, so that all tests in one
  function leave the function with the same approach.
- unconditionally tests for .deleted immediately after every call to
  sem_lock - even it it means that for semctl(GETALL), .deleted will be
  tested twice.

Both changes make the review simpler: After every sem_lock, there must
be a test of .deleted, followed by a goto to the cleanup code (if the
function uses "goto cleanup").
The only exception is semctl_down(): If sem_ids().rwsem is locked, then
the presence in ids->ipcs_idr is equivalent to !.deleted, thus no additional
test is required.

Davidlohr: What do you think?

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 8c4f59b..db9d241 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1282,6 +1282,12 @@ static int semctl_setval(struct ipc_namespace *ns, int 
semid, int semnum,
 
sem_lock(sma, NULL, -1);
 
+   if (sma->sem_perm.deleted) {
+   sem_unlock(sma, -1);
+   rcu_read_unlock();
+   return -EIDRM;
+   }
+
curr = &sma->sem_base[semnum];
 
ipc_assert_locked_object(&sma->sem_perm);
@@ -1336,12 +1342,14 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
int i;
 
sem_lock(sma, NULL, -1);
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
+   }
if(nsems > SEMMSL_FAST) {
if (!ipc_rcu_getref(sma)) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
sem_unlock(sma, -1);
rcu_read_unlock();
@@ -1354,10 +1362,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
}
for (i = 0; i < sma->sem_nsems; i++)
@@ -1375,8 +1381,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
struct sem_undo *un;
 
if (!ipc_rcu_getref(sma)) {
-   rcu_read_unlock();
-   return -EIDRM;
+   err = -EIDRM;
+   goto out_rcu_wakeup;
}
rcu_read_unlock();
 
@@ -1404,10 +1410,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
 
for (i = 0; i < nsems; i++)
@@ -1431,6 +1435,10 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
goto out_rcu_wakeup;
 
sem_lock(sma, NULL, -1);
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
+   }
curr = &sma->sem_base[semnum];
 
switch (cmd) {
@@ -1836,6 +1844,10 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
if (error)
goto out_rcu_wakeup;
 
+   error = -EIDRM;
+   locknum = sem_lock(sma, sops, nsops);
+   if (sma->sem_perm.deleted)
+   goto out_unlock_free;
/*
 * semid identifiers are not unique - find_alloc_undo may have
 * allocated an undo structure, it was invalidated by an RMID
@@ -1843,8 +1855,6 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
 * This case can be detected checking un->semid. The existence of
 * "un" itself is guaranteed by rcu.
 */
-   error = -EIDRM;
-   locknum = sem_lock(sma, sops, nsops);
if (un &&

[PATCH 1/2] ipc/sem.c: Race in sem_lock()

2013-09-14 Thread Manfred Spraul
The exclusion of complex operations in sem_lock() is insufficient:
After acquiring the per-semaphore lock, a simple op must first check that
sem_perm.lock is not locked and only after that test check complex_count.
The current code does it the other way around - and that creates a race.
Details are below.

The patch is a complete rewrite of sem_lock(), based in part on the code from
Mike Galbraith. It removes all gotos and all loops and thus the risk of
livelocks.

I have tested the patch (together with the next one) on my i3 laptop and it
didn't cause any problems.

What do you think?
I think the patch should be merged because the current sem_lock function is
much more complex than necessary.

The bug is probably also present in 3.10 and 3.11, but for these kernels
is is probably simpler just to move the test of sma->complex_count after
the spin_is_locked() test.

Details of the bug:

Assume:
- sma->complex_count = 0.
- Thread 1: semtimedop(complex op that must sleep)
- Thread 2: semtimedop(simple op).

Pseudo-Trace:

Thread 1: sem_lock(): acquire sem_perm.lock
Thread 1: sem_lock(): check for ongoing simple ops
Nothing ongoing, thread 2 is still before sem_lock().
Thread 1: try_atomic_semop()
<<< preempted.

Thread 2: sem_lock():
static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
  int nsops)
{
int locknum;
 again:
if (nsops == 1 && !sma->complex_count) {
struct sem *sem = sma->sem_base + sops->sem_num;

/* Lock just the semaphore we are interested in. */
spin_lock(&sem->lock);

/*
 * If sma->complex_count was set while we were spinning,
 * we may need to look at things we did not lock here.
 */
if (unlikely(sma->complex_count)) {
spin_unlock(&sem->lock);
goto lock_array;
}
<<<<<<<<<
<<< complex_count is still 0.
<<<
<<< Here it is preempted
<<<<<<<<<

Thread 1: try_atomic_semop() returns, notices that it must sleep.
Thread 1: increases sma->complex_count.
Thread 1: drops sem_perm.lock
Thread 2:
/*
 * Another process is holding the global lock on the
 * sem_array; we cannot enter our critical section,
 * but have to wait for the global lock to be released.
 */
if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
spin_unlock(&sem->lock);
spin_unlock_wait(&sma->sem_perm.lock);
        goto again;
}
<<< sem_perm.lock already dropped, thus no "goto again;"

locknum = sops->sem_num;

Signed-off-by: Manfred Spraul 
Cc: Mike Galbraith 
Cc: Rik van Riel 
Cc: Davidlohr Bueso 
Cc: Andrew Morton 
---
 ipc/sem.c | 122 +++---
 1 file changed, 78 insertions(+), 44 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 69b6a21..4836ea7 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -244,70 +244,104 @@ static void merge_queues(struct sem_array *sma)
 }
 
 /*
+ * Wait until all currently ongoing simple ops have completed.
+ * Caller must own sem_perm.lock.
+ * New simple ops can start, because simple ops first check
+ * that sem_perm.lock is free.
+ */
+static void sem_wait_array(struct sem_array *sma)
+{
+   int i;
+   struct sem *sem;
+
+   for (i = 0; i < sma->sem_nsems; i++) {
+   sem = sma->sem_base + i;
+   spin_unlock_wait(&sem->lock);
+   }
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
  * multiple semaphores in our own semops, or we need to look at
  * semaphores from other pending complex operations.
- *
- * Carefully guard against sma->complex_count changing between zero
- * and non-zero while we are spinning for the lock. The value of
- * sma->complex_count cannot change while we are holding the lock,
- * so sem_unlock should be fine.
- *
- * The global lock path checks that all the local locks have been released,
- * checking each local lock once. This means that the local lock paths
- * cannot start their critical sections while the global lock is held.
  */
 static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
  int nsops)
 {
-   int lo

[PATCH 2/2] ipc/sem.c: optimize sem_lock().

2013-09-14 Thread Manfred Spraul
Operations that need access to the whole array must guarantee that there are
no simple operations ongoing. Right now this is achieved by
spin_unlock_wait(sem->lock) on all semaphores.

If complex_count is nonzero, then this spin_unlock_wait() is not necessary,
because it was already performed in the past by the thread that increased
complex_count and even though sem_perm.lock was dropped inbetween, no simple
operation could have started, because simple operations cannot start when
complex_count is non-zero.

What do you think?
The patch survived some testing.

Its not a bugfix - thus I don't know if it should go into linux-next first.

Signed-off-by: Manfred Spraul 
Cc: Mike Galbraith 
Cc: Rik van Riel 
Cc: Davidlohr Bueso 
Cc: Andrew Morton 
---
 ipc/sem.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 4836ea7..5274ed1 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -248,12 +248,20 @@ static void merge_queues(struct sem_array *sma)
  * Caller must own sem_perm.lock.
  * New simple ops can start, because simple ops first check
  * that sem_perm.lock is free.
+ * that a) sem_perm.lock is free and b) complex_count is 0.
  */
 static void sem_wait_array(struct sem_array *sma)
 {
int i;
struct sem *sem;
 
+   if (sma->complex_count)  {
+   /* The thread that increased sma->complex_count waited on
+* all sem->lock locks. Thus we don't need to wait again.
+*/
+   return;
+   }
+
for (i = 0; i < sma->sem_nsems; i++) {
sem = sma->sem_base + i;
spin_unlock_wait(&sem->lock);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ipc/sem.c: Race in sem_lock()

2013-09-15 Thread Manfred Spraul

Hi all,

On 09/15/2013 08:09 AM, Mike Galbraith wrote:

On Sat, 2013-09-14 at 23:34 +0200, Manfred Spraul wrote:


The bug is probably also present in 3.10 and 3.11, but for these kernels
is is probably simpler just to move the test of sma->complex_count after
the spin_is_locked() test.

IMHO, your 6 patch series should go to stable as well.  Scalability is
still BAD without them.  Now, you've shown the lock split to be buggy.

Logically, the whole thing should be reverted entirely in stable, or
fixed up properly.

Davidlohr: Are you working on fixing the open issues?

IMHO Mike is right, especially for the 3.10 long-term kernel:
Either everything in ipc/*.c must be reverted or it should be fixed 
properly (i.e.: cherry-pick ipc/*)


I have created bugzilla entries for all issues I'm aware of:

https://bugzilla.kernel.org/show_bug.cgi?id=61351
I sent a patch yesterday.

https://bugzilla.kernel.org/show_bug.cgi?id=61321
https://bugzilla.kernel.org/show_bug.cgi?id=61331
https://bugzilla.kernel.org/show_bug.cgi?id=61341
https://bugzilla.kernel.org/show_bug.cgi?id=61361
https://bugzilla.kernel.org/show_bug.cgi?id=61371
No patches for theses 5 bugs.

And: Given these numbers from Mike, I would hate to revert anything:
On 09/15/2013 10:06 AM, Mike Galbraith wrote:

On Sun, 2013-09-15 at 08:09 +0200, Mike Galbraith wrote:


Humongous improvements...

(a couple sem-waitzero numbers)

master:  Cpus 64, interleave 1 delay 0: 10039494796 in 30 secs
3.10.10: Cpus 64, interleave 1 delay 0:   129236313 in 30 secs

(rapidly scrolling micro-font bench vs reality disclaimer)

One semop() completed every 3 ns, around 600 cpu ticks per operation.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] ipc,shm: fix race with selinux

2013-09-16 Thread Manfred Spraul

On 09/16/2013 05:04 AM, Davidlohr Bueso wrote:

Currently, we check shm security only under RCU. Since selinux
can free the security structure, through selinux_sem_free_security(),
we can run into a use-after-free condition. This bug affects both
shmctl and shmat syscalls.

The fix is obvious, make sure we hold the kern_ipc_perm.lock while
performing these security checks.

Actually: either kern_ipc_perm or down_xx(&shm_ids(ns).rwsem) is sufficient.



Reported-by: Manfred Spraul 
Signed-off-by: Davidlohr Bueso 
---
  ipc/shm.c | 23 ++-
  1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 2821cdf..bc3e897 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -781,18 +781,17 @@ static int shmctl_down(struct ipc_namespace *ns, int 
shmid, int cmd,
  
  	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
  
+	ipc_lock_object(&shp->shm_perm);

err = security_shm_shmctl(shp, cmd);
if (err)
-   goto out_unlock1;
+   goto out_unlock0;
This change is not necessary: down_write(shm_ids(ns).rwsem) already 
synchronizes against another IPC_RMID.

But it doesn't hurt.


@@ -960,11 +962,12 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
}
  
  		audit_ipc_obj(&(shp->shm_perm));

+
+   ipc_lock_object(&shp->shm_perm);
err = security_shm_shmctl(shp, cmd);

What about audit_ipc_obj()?
calls __audit_ipc_obj
calls security_ipc_getsecid
calls security_ops->ipc_getsecid, which can be selinux_ipc_getsecid
selinux_ipc_getsecid dereferences ipcp->security

Please: Restart from 3.0.9 (i.e. before the scalability improvement 
project was started)
Every function that is moved from "synchronization with ipc_lock()" to 
"only rcu_read_lock() held" must be checked.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: soft lockup in sysvipc code.

2013-09-07 Thread Manfred Spraul

Hi Dave,

On 09/04/2013 11:50 PM, Dave Jones wrote:

Haven't seen this before.
Tree based on v3.11-3104-gf357a82

BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:25479]
Modules linked in: sctp snd_seq_dummy fuse dlci rfcomm tun bnep hidp ipt_ULOG 
nfnetlink can_raw can_bcm scsi_transport_iscsi nfc caif_socket caif af_802154 
phonet af_rxrpc bluetooth rfkill can llc2 pppoe pppox ppp_generic slhc irda 
crc_ccitt rds af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc 
ax25 xfs snd_hda_codec_realtek libcrc32c snd_hda_intel snd_hda_codec snd_hwdep 
snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd soundcore pcspkr 
usb_debug e1000e ptp pps_core
irq event stamp: 1143030
hardirqs last  enabled at (1143029): [] restore_args+0x0/0x30
hardirqs last disabled at (1143030): [] 
apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (1143028): [] 
__do_softirq+0x198/0x460
softirqs last disabled at (1143023): [] irq_exit+0x135/0x150
CPU: 0 PID: 25479 Comm: trinity-child0 Not tainted 3.11.0+ #44
task: 88022c013f90 ti: 88022bd8c000 task.ti: 88022bd8c000
RIP: 0010:[]  [] 
idr_find_slowpath+0x9b/0x150
RSP: 0018:88022bd8dc88  EFLAGS: 0206
RAX: 0006 RBX: 000a6c0a RCX: 0008
RDX: 0008 RSI: 81c41040 RDI: 88022c014668
RBP: 88022bd8dca0 R08:  R09: 
R10: 0001 R11: 0001 R12: 88023831a290
R13: 0001 R14: 88022bd8dbe8 R15: 8802449d
FS:  7fcfcad2c740() GS:88024480() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fcfc84cb968 CR3: 0001de93f000 CR4: 001407f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Stack:
  0260 2dba 81c7e258 88022bd8dcf8
  812b1131 88022c013f90 8801d37174c0 88022bd8dd38
  81c7e2f0 88022bd8dd38 8801e065cec8 880241d86ca8
Call Trace:
  [] sysvipc_find_ipc+0x61/0x300
  [] sysvipc_proc_next+0x46/0xd0
  [] traverse.isra.7+0xc9/0x260
  [] ? lock_release_non_nested+0x308/0x350
  [] seq_read+0x3e1/0x450
  [] ? proc_reg_write+0x80/0x80
  [] proc_reg_read+0x3d/0x80

Do you have any details about load?

I haven't seen it either - but I don't have a stress test that does
#while true;cat /proc/sysvipc/*>/dev/null;done
in parallel with create/remove/whatever operations.

Davidlohr: Have you done any stress tests for the /proc interface?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,msg: shorten critical region in msgsnd

2013-09-12 Thread Manfred Spraul

Hi Davidlohr,

I think the patch (3dd1f784ed6603d7ab1043e51e6371235edf2313) is still 
unsafe, i.e. my correction (bebcb928c820d0ee83aca4b192adc195e43e66a2) 
doesn't fix everything:


AFAICS, ipc_obtain_object_check:
- look up the id in the idr tree
- check if it is deleted
- return without taking any locks.

This means that the "is not deleted" information can be stale immediately.

Thus do_msgsnd() in ipc/msg.c contains a memory leak:

   rcu_read_lock();
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
err = PTR_ERR(msq);
goto out_unlock1;
}

 what if the code is preempted here and RMID is processed?
The code below would queue the message into an already removed queue.
The queue is freed by the rcu callback, but the message memory is leaked.


ipc_lock_object(&msq->q_perm);



Is this analysis correct?

And: What about the other users of obtain_object_check?
exit_sem() is also quite long, but I didn't spot any obvious problems.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,msg: shorten critical region in msgsnd

2013-09-12 Thread Manfred Spraul

Hi all,

On 09/12/2013 02:20 PM, Manfred Spraul wrote:


And: What about the other users of obtain_object_check?
exit_sem() is also quite long, but I didn't spot any obvious problems.


a) I think semtimed(), msgsnd() and msgrcv() must be fixed:
They either leak memory or tasks can sleep forever.
I haven't checked the shm code, I would expect that there are similar 
problems.


b) There are additional races at least with selinux:
security/selinux/hooks.c
- selinux_sem_semop() accesses sma->sem_perm.security->sid.
- selinux_sem_free_security() does kfree() q_perm.security.

Right now, both operations can happen in parallel -> use after free.

I think the security_xx_yy() calls within ipc/*.c must only be called:
- after checking _perm.deleted
- with ipc_perm.lock acquired (to prevent parallel RMID calls).

Davidlohr:
What would be your proposal?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc-msg broken again on 3.11-rc7?

2013-09-02 Thread Manfred Spraul

Hi,

[forgot to cc everyone, thus I'll summarize some mails...]
On 09/02/2013 06:58 AM, Vineet Gupta wrote:

On 08/31/2013 11:20 PM, Linus Torvalds wrote:

Vineet, actual patch for what Davidlohr suggests attached. Can you try it?

  Linus

Apologies for late in getting back to this - I was away from my computer for a 
bit.

Unfortunately, with a quick test, this patch doesn't help.
FWIW, this is latest mainline (.config attached).

Let me know what diagnostics I can add to help with this.


msgctl08 is a bulk message send/receive test. I had to look at it once 
before, then it was a broken hardware:

https://lkml.org/lkml/2008/6/12/365
This can be ruled out, because it works with 3.10.

msgctl08 uses pairs of threads: one thread does msgsnd(), the other one 
msgrcv().
There is no synchronization, i.e. the msgsnd() can race ahead until the 
kernel buffer is full and then a block with msgrcv() follows or it could 
be pairs of alternating msgsnd()/msgrcv() operations.
No special features are used: each pair of threads has it's own message 
queues, all messages have type=1.


Vineet ran strace - and just before the signal from killing msgctl08, 
there are only msgsnd()/msgrcv() calls.

Vineet:
a) could you run strace tomorrow again, with '-ttt' as an additional 
option? I don't see where exactly it hangs.

b) Could you check that it is not just a performance regression?
Does ./msgctl08 1000 16 hang, too?

In ipc/msg.c, I haven't seen any obvious reason why it should hang.
The only race I spotted so far is this one:

  for (;;) {
struct msg_sender s;

err = -EACCES;
if (ipcperms(ns, &msq->q_perm, S_IWUGO))
goto out_unlock1;

err = security_msg_queue_msgsnd(msq, msg, msgflg);
if (err)
goto out_unlock1;

if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
1 + msq->q_qnum <= msq->q_qbytes) {
break;
}


[snip]

if (!pipelined_send(msq, msg)) {
/* no one is waiting for this message, enqueue it */
list_add_tail(&msg->m_list, &msq->q_messages);
msq->q_cbytes += msgsz;
msq->q_qnum++;
atomic_add(msgsz, &ns->msg_bytes);


The access to msq->q_cbytes is not protected. Thus two parallel msgsnd() 
calls could succeed, even if both together brings the queue length above 
the limit.
But it can't explain why 3.11-rc7 hangs: As explained above, msgctl08 
uses one queue for each thread pair.


--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc-msg broken again on 3.11-rc7?

2013-09-03 Thread Manfred Spraul

On 09/03/2013 10:44 AM, Vineet Gupta wrote:

b) Could you check that it is not just a performance regression?
  Does ./msgctl08 1000 16 hang, too?

Nope that doesn't hang. The minimal configuration that hangs reliably is msgctl
5 2

With this config there are 3 processes.
...
   555   554 root S 1208  0.4   0  0.0 ./msgctl08 5 2
   554   551 root S 1208  0.4   0  0.0 ./msgctl08 5 2
   551   496 root S 1208  0.4   0  0.0 ./msgctl08 5 2
...

[ARCLinux]$ cat /proc/551/stack
[<80aec3c6>] do_wait+0xa02/0xc94
[<80aecad2>] SyS_wait4+0x52/0xa4
[<80ae24fc>] ret_from_system_call+0x0/0x4

[ARCLinux]$ cat /proc/555/stack
[<80c2950e>] SyS_msgrcv+0x252/0x420
[<80ae24fc>] ret_from_system_call+0x0/0x4

[ARCLinux]$ cat /proc/554/stack
[<80c28c82>] do_msgsnd+0x116/0x35c
[<80ae24fc>] ret_from_system_call+0x0/0x4

Is this a case of lost wakeup or some such. I'm running with some more 
diagnostics
and will report soon ...

What is the output of ipcs -q? Is the queue full or empty when it hangs?
I.e. do we forget to wake up a receiver or forget to wake up a sender?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc-msg broken again on 3.11-rc7?

2013-09-03 Thread Manfred Spraul

On 09/03/2013 11:16 AM, Vineet Gupta wrote:

On 09/03/2013 02:27 PM, Manfred Spraul wrote:

On 09/03/2013 10:44 AM, Vineet Gupta wrote:

b) Could you check that it is not just a performance regression?
   Does ./msgctl08 1000 16 hang, too?

Nope that doesn't hang. The minimal configuration that hangs reliably is msgctl
5 2

With this config there are 3 processes.
...
555   554 root S 1208  0.4   0  0.0 ./msgctl08 5 2
554   551 root S 1208  0.4   0  0.0 ./msgctl08 5 2
551   496 root S 1208  0.4   0  0.0 ./msgctl08 5 2
...

[ARCLinux]$ cat /proc/551/stack
[<80aec3c6>] do_wait+0xa02/0xc94
[<80aecad2>] SyS_wait4+0x52/0xa4
[<80ae24fc>] ret_from_system_call+0x0/0x4

[ARCLinux]$ cat /proc/555/stack
[<80c2950e>] SyS_msgrcv+0x252/0x420
[<80ae24fc>] ret_from_system_call+0x0/0x4

[ARCLinux]$ cat /proc/554/stack
[<80c28c82>] do_msgsnd+0x116/0x35c
[<80ae24fc>] ret_from_system_call+0x0/0x4

Is this a case of lost wakeup or some such. I'm running with some more 
diagnostics
and will report soon ...

What is the output of ipcs -q? Is the queue full or empty when it hangs?
I.e. do we forget to wake up a receiver or forget to wake up a sender?

/ # ipcs -q

-- Message Queues 
keymsqid  owner  perms  used-bytes   messages
0x72d83160 163841 root   60000



Ok, a sender is sleeping - even though there are no messages in the queue.
Perhaps it is the race that I mentioned in a previous mail:

  for (;;) {
struct msg_sender s;

err = -EACCES;
if (ipcperms(ns, &msq->q_perm, S_IWUGO))
goto out_unlock1;

err = security_msg_queue_msgsnd(msq, msg, msgflg);
if (err)
goto out_unlock1;

if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
1 + msq->q_qnum <= msq->q_qbytes) {
break;
}


[snip]

if (!pipelined_send(msq, msg)) {
/* no one is waiting for this message, enqueue it */
list_add_tail(&msg->m_list, &msq->q_messages);
msq->q_cbytes += msgsz;
msq->q_qnum++;
atomic_add(msgsz, &ns->msg_bytes);


The access to msq->q_cbytes is not protected.

Vineet, could you try to move the test for free space after ipc_lock?
I.e. the lock must not get dropped between testing for free space and 
enqueueing the messages.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc-msg broken again on 3.11-rc7?

2013-09-03 Thread Manfred Spraul

Hi Vineet,

On 09/03/2013 11:51 AM, Vineet Gupta wrote:

On 09/03/2013 02:53 PM, Manfred Spraul wrote:


The access to msq->q_cbytes is not protected.

Vineet, could you try to move the test for free space after ipc_lock?
I.e. the lock must not get dropped between testing for free space and
enqueueing the messages.

Hmm, the code movement is not trivial. I broke even the simplest of cases (patch
attached). This includes the additional change which Linus/Davidlohr had asked 
for.

The attached patch should work. Could you try it?

--
Manfred
diff --git a/ipc/msg.c b/ipc/msg.c
index 9f29d9e..b65fdf1 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -680,16 +680,18 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
goto out_unlock1;
}
 
+   ipc_lock_object(&msq->q_perm);
+
for (;;) {
struct msg_sender s;
 
err = -EACCES;
if (ipcperms(ns, &msq->q_perm, S_IWUGO))
-   goto out_unlock1;
+   goto out_unlock0;
 
err = security_msg_queue_msgsnd(msq, msg, msgflg);
if (err)
-   goto out_unlock1;
+   goto out_unlock0;
 
if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
1 + msq->q_qnum <= msq->q_qbytes) {
@@ -699,10 +701,9 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
/* queue full, wait: */
if (msgflg & IPC_NOWAIT) {
err = -EAGAIN;
-   goto out_unlock1;
+   goto out_unlock0;
}
 
-   ipc_lock_object(&msq->q_perm);
ss_add(msq, &s);
 
if (!ipc_rcu_getref(msq)) {
@@ -730,10 +731,7 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
goto out_unlock0;
}
 
-   ipc_unlock_object(&msq->q_perm);
}
-
-   ipc_lock_object(&msq->q_perm);
msq->q_lspid = task_tgid_vnr(current);
msq->q_stime = get_seconds();
 


[PATCH] ipc/msg.c: Fix lost wakeup in msgsnd().

2013-09-03 Thread Manfred Spraul
The check if the queue is full and adding current to the wait queue of pending
msgsnd() operations (ss_add()) must be atomic.

Otherwise:
- the thread that performs msgsnd() finds a full queue and decides to sleep.
- the thread that performs msgrcv() calls first reads all messages from the
  queue and then sleep, because the queue is empty.
- the msgrcv() calls do not perform any wakeups, because the msgsnd() task
  has not yet called ss_add().
- then the msgsnd()-thread first calls ss_add() and then sleeps.
Net result: msgsnd() and msgrcv() both sleep forever.

Observed with msgctl08 from ltp with a preemptible kernel.

Fix: Call ipc_lock_object() before performing the check.

The patch also moves security_msg_queue_msgsnd() under ipc_lock_object:
- msgctl(IPC_SET) explicitely mentions that it tries to expunge any pending
  operations that are not allowed anymore with the new permissions.
  If security_msg_queue_msgsnd() is called without locks, then there might be
  races.
- it makes the patch much simpler.

Reported-by: Vineet Gupta 
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 9f29d9e..b65fdf1 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -680,16 +680,18 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
goto out_unlock1;
}
 
+   ipc_lock_object(&msq->q_perm);
+
for (;;) {
struct msg_sender s;
 
err = -EACCES;
if (ipcperms(ns, &msq->q_perm, S_IWUGO))
-   goto out_unlock1;
+   goto out_unlock0;
 
err = security_msg_queue_msgsnd(msq, msg, msgflg);
if (err)
-   goto out_unlock1;
+   goto out_unlock0;
 
if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
1 + msq->q_qnum <= msq->q_qbytes) {
@@ -699,10 +701,9 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
/* queue full, wait: */
if (msgflg & IPC_NOWAIT) {
err = -EAGAIN;
-   goto out_unlock1;
+   goto out_unlock0;
}
 
-   ipc_lock_object(&msq->q_perm);
ss_add(msq, &s);
 
if (!ipc_rcu_getref(msq)) {
@@ -730,10 +731,7 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
goto out_unlock0;
}
 
-   ipc_unlock_object(&msq->q_perm);
}
-
-   ipc_lock_object(&msq->q_perm);
msq->q_lspid = task_tgid_vnr(current);
msq->q_stime = get_seconds();
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/msg.c: Fix lost wakeup in msgsnd().

2013-09-03 Thread Manfred Spraul

Hi Sedat,

On 09/03/2013 06:13 PM, Sedat Dilek wrote:

On Tue, Sep 3, 2013 at 4:00 PM, Manfred Spraul  wrote:

The check if the queue is full and adding current to the wait queue of pending
msgsnd() operations (ss_add()) must be atomic.

Otherwise:
- the thread that performs msgsnd() finds a full queue and decides to sleep.
- the thread that performs msgrcv() calls first reads all messages from the
   queue and then sleep, because the queue is empty.

reads -> sleeps

Correct.

- the msgrcv() calls do not perform any wakeups, because the msgsnd() task
   has not yet called ss_add().
- then the msgsnd()-thread first calls ss_add() and then sleeps.
Net result: msgsnd() and msgrcv() both sleep forever.


I don't know what and why "net result" - net in sense of networking?

http://en.wiktionary.org/wiki/net#Adjective
I.e.: Ignore/remove the "Net".


Observed with msgctl08 from ltp with a preemptible kernel.


...on ARC arch (that sounds funny somehow).


Fix: Call ipc_lock_object() before performing the check.

The patch also moves security_msg_queue_msgsnd() under ipc_lock_object:
- msgctl(IPC_SET) explicitely mentions that it tries to expunge any pending
   operations that are not allowed anymore with the new permissions.
   If security_msg_queue_msgsnd() is called without locks, then there might be
   races.
- it makes the patch much simpler.

Reported-by: Vineet Gupta 
Signed-off-by: Manfred Spraul 

I guess this is missing a "CC: stable" as Vineet reported against
Linux v3.11-rc7 (and should enter v3.11.1)?

Yes. I didn't notice that Linus already released 3.11.

--
Manfred

- Sedat -


---
  ipc/msg.c | 12 +---
  1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 9f29d9e..b65fdf1 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -680,16 +680,18 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
 goto out_unlock1;
 }

+   ipc_lock_object(&msq->q_perm);
+
 for (;;) {
 struct msg_sender s;

 err = -EACCES;
 if (ipcperms(ns, &msq->q_perm, S_IWUGO))
-   goto out_unlock1;
+   goto out_unlock0;

 err = security_msg_queue_msgsnd(msq, msg, msgflg);
 if (err)
-   goto out_unlock1;
+   goto out_unlock0;

 if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
 1 + msq->q_qnum <= msq->q_qbytes) {
@@ -699,10 +701,9 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
 /* queue full, wait: */
 if (msgflg & IPC_NOWAIT) {
 err = -EAGAIN;
-   goto out_unlock1;
+   goto out_unlock0;
 }

-   ipc_lock_object(&msq->q_perm);
 ss_add(msq, &s);

 if (!ipc_rcu_getref(msq)) {
@@ -730,10 +731,7 @@ long do_msgsnd(int msqid, long mtype, void __user *mtext,
 goto out_unlock0;
 }

-   ipc_unlock_object(&msq->q_perm);
 }
-
-   ipc_lock_object(&msq->q_perm);
 msq->q_lspid = task_tgid_vnr(current);
 msq->q_stime = get_seconds();

--
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Update sem_otime for all operations

2013-09-25 Thread Manfred Spraul
Hi Jia,

Could you check if the patch below resolves the bug you have reported?
Just this patch, i.e. without your proposal.

I want to leave the optimization for the get_seconds() call:
We must update sem_otime in two places anyway
(either perform_atomic_semop() and exit_sem() or
do_smart_update() and semtimedop())

--
Manfred

In commit 0a2b9d4c,the update of semaphore's sem_otime(last semop time)
was moved to one central position (do_smart_update).

But: Since do_smart_update() is only called for operations that modify
the array, this means that wait-for-zero semops do not update sem_otime
anymore.

The fix is simple:
Non-alter operations must update sem_otime.

Signed-off-by: Manfred Spraul 
Reported-by: Jia He 
---
 ipc/sem.c | 41 +
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index e5d9bb8..ec83f79 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -910,6 +910,24 @@ again:
 }
 
 /**
+ * set_semotime(sma, sops) - set sem_otime
+ * @sma: semaphore array
+ * @sops: operations that modified the array, may be NULL
+ *
+ * sem_otime is replicated to avoid cache line trashing.
+ * This function sets one instance to the current time.
+ */
+static void set_semotime(struct sem_array *sma, struct sembuf *sops)
+{
+   if (sops == NULL) {
+   sma->sem_base[0].sem_otime = get_seconds();
+   } else {
+   sma->sem_base[sops[0].sem_num].sem_otime =
+   get_seconds();
+   }
+}
+
+/**
  * do_smart_update(sma, sops, nsops, otime, pt) - optimized update_queue
  * @sma: semaphore array
  * @sops: operations that were performed
@@ -959,17 +977,10 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
}
}
}
-   if (otime) {
-   if (sops == NULL) {
-   sma->sem_base[0].sem_otime = get_seconds();
-   } else {
-   sma->sem_base[sops[0].sem_num].sem_otime =
-   get_seconds();
-   }
-   }
+   if (otime)
+   set_semotime(sma, sops);
 }
 
-
 /* The following counts are associated to each semaphore:
  *   semncntnumber of tasks waiting on semval being nonzero
  *   semzcntnumber of tasks waiting on semval being zero
@@ -1831,10 +1842,16 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
 
error = perform_atomic_semop(sma, sops, nsops, un,
task_tgid_vnr(current));
-   if (error <= 0) {
-   if (alter && error == 0)
+   if (error == 0) {
+   /* If the operation was successful, then do
+* the required updates.
+*/
+   if (alter)
do_smart_update(sma, sops, nsops, 1, &tasks);
-
+   else
+   set_semotime(sma, sops);
+   }
+   if (error <= 0) {
goto out_unlock_free;
}
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] ipc,shm: prevent race with rmid in shmat(2)

2013-09-26 Thread Manfred Spraul

Hi Davidlohr,

On 09/16/2013 05:04 AM, Davidlohr Bueso wrote:

This fixes a race in shmat() between finding the msq and
actually attaching the segment, as another thread can delete shmid
underneath us if we are preempted before acquiring the kern_ipc_perm.lock.
According the the man page, Linux supports attaching to deleted shm 
segments:


http://linux.die.net/man/2/shmat


On Linux, it is possible to attach a shared memory segment even if it 
is already marked to be deleted. However, POSIX.1-2001 does not 
specify this behavior and many other implementations do not support it.



Does your patch change that?
Unfortunately, I have neither any experience with ipc/shm nor any test 
cases.


And:
As far as I can see it's not a problem if we are attaching to a deleted 
segment: shm_close cleans up everything.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] ipc,msg: prevent race with rmid in msgsnd,msgrcv

2013-09-26 Thread Manfred Spraul

Hi Andrew,

Could you include the patch in -mm and push it towards Linus?

On 09/16/2013 05:04 AM, Davidlohr Bueso wrote:

This fixes a race in both msgrcv() and msgsnd() between finding the msg and
actually dealing with the queue, as another thread can delete shmid
underneath us if we are preempted before acquiring the kern_ipc_perm.lock.

Manfred illustrates this nicely:

Assume a preemptible kernel that is preempted just after

msq = msq_obtain_object_check(ns, msqid)

in do_msgrcv().
The only lock that is held is rcu_read_lock().

Now the other thread processes IPC_RMID.
When the first task is resumed, then it will happily wait for messages on a
deleted queue.

Fix this by checking for if the queue has been deleted after taking the lock.

Reported-by: Manfred Spraul 
Cc: sta...@vger.kernel.org # for 3.11
Signed-off-by: Davidlohr Bueso 

Signed-off-by: Manfred Spraul 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc/sem.c: synchronize semop and semctl with IPC_RMID

2013-09-30 Thread Manfred Spraul
After acquiring the semlock spinlock, the operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but should
  be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
  access memory after free.

The patch standardizes the tests for .deleted, so that all tests in one
function leave the function with the same approach.

Right now, it's a mixture of "goto cleanup", some cleanup and then
"goto further_cleanup" and all cleanup+"return -EIDRM" - that makes the
review much harder.

Davidlohr: Could you please review the patch?
I did some stress test, but probably I didn't hit exactly the modified
lines.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 43 ++-
 1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 19c8b98..a2fa795 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1229,6 +1229,12 @@ static int semctl_setval(struct ipc_namespace *ns, int 
semid, int semnum,
 
sem_lock(sma, NULL, -1);
 
+   if (sma->sem_perm.deleted) {
+   sem_unlock(sma, -1);
+   rcu_read_unlock();
+   return -EIDRM;
+   }
+
curr = &sma->sem_base[semnum];
 
ipc_assert_locked_object(&sma->sem_perm);
@@ -1285,10 +1291,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
sem_lock(sma, NULL, -1);
if(nsems > SEMMSL_FAST) {
if (!ipc_rcu_getref(sma)) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
sem_unlock(sma, -1);
rcu_read_unlock();
@@ -1301,10 +1305,13 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
+   }
+   } else {
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
}
}
for (i = 0; i < sma->sem_nsems; i++)
@@ -1322,8 +1329,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
struct sem_undo *un;
 
if (!ipc_rcu_getref(sma)) {
-   rcu_read_unlock();
-   return -EIDRM;
+   err = -EIDRM;
+   goto out_rcu_wakeup;
}
rcu_read_unlock();
 
@@ -1351,10 +1358,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
 
for (i = 0; i < nsems; i++)
@@ -1378,6 +1383,10 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
goto out_rcu_wakeup;
 
sem_lock(sma, NULL, -1);
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
+   }
curr = &sma->sem_base[semnum];
 
switch (cmd) {
@@ -1783,6 +1792,10 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
if (error)
goto out_rcu_wakeup;
 
+   error = -EIDRM;
+   locknum = sem_lock(sma, sops, nsops);
+   if (sma->sem_perm.deleted)
+   goto out_unlock_free;
/*
 * semid identifiers are not unique - find_alloc_undo may have
 * allocated an undo structure, it was invalidated by an RMID
@@ -1790,8 +1803,6 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
 * This case can be detected checking un->semid. The existence of
 * "un" itself is guaranteed by rcu.
 */
-   error = -EIDRM;
-   locknum = sem_lock(sma, sops, nsops);
if (un && un->semid == -1)
goto out_unlock_free;
 
@@ -1999,6 +2010,12 @@ void exit_sem(struct task_struct *tsk)
}
 
sem_lock(sma, NUL

Re: [PATCH] ipc/sem.c: synchronize semop and semctl with IPC_RMID

2013-09-30 Thread Manfred Spraul

Hi Davidlohr,

On 09/30/2013 07:54 PM, Davidlohr Bueso wrote:

Hi Manfred,

On Mon, 2013-09-30 at 11:13 +0200, Manfred Spraul wrote:

After acquiring the semlock spinlock, the operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but should
   be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
   access memory after free.

Yep, that was next on my list - I had a patch for semtimedop() but was
waiting to rebase it on top of your previous changes. Anyway thanks for
sending this.


The patch standardizes the tests for .deleted, so that all tests in one
function leave the function with the same approach.

Right now, it's a mixture of "goto cleanup", some cleanup and then
"goto further_cleanup" and all cleanup+"return -EIDRM" - that makes the
review much harder.

Davidlohr: Could you please review the patch?
I did some stress test, but probably I didn't hit exactly the modified
lines.

This shouldn't affect performance, if that's what you mean.
All goto's must go to the correct target, free everything, unlock 
everything, do not unlock twice, ...



  One more
read in the critical region won't make any difference. The patch looks
good, just one doubt below.



Signed-off-by: Manfred Spraul 
---
  ipc/sem.c | 43 ++-
  1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 19c8b98..a2fa795 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1229,6 +1229,12 @@ static int semctl_setval(struct ipc_namespace *ns, int 
semid, int semnum,
  
  	sem_lock(sma, NULL, -1);
  
+	if (sma->sem_perm.deleted) {

+   sem_unlock(sma, -1);
+   rcu_read_unlock();
+   return -EIDRM;
+   }
+
curr = &sma->sem_base[semnum];
  
  	ipc_assert_locked_object(&sma->sem_perm);

@@ -1285,10 +1291,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
sem_lock(sma, NULL, -1);
if(nsems > SEMMSL_FAST) {
if (!ipc_rcu_getref(sma)) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
sem_unlock(sma, -1);
rcu_read_unlock();
@@ -1301,10 +1305,13 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;

^^ check if nsems > SEMMSL_FAST

-   goto out_free;
+   goto out_unlock;
+   }
+   } else {
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
}

I'm a bit lost here. Why should we only check the existence of the sem
if nsems <= SEMMSL_FAST? Shouldn't the same should apply either way?

It is checked in both branches:
- the check for "nsems > SEMMSL_FAST" was always there, due to the 
kmalloc, the lock is dropped.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: fix update sem_otime when calling sem_op in semaphore initialization

2013-09-22 Thread Manfred Spraul

Hi all,

On 09/22/2013 10:26 AM, Mike Galbraith wrote:

On Sun, 2013-09-22 at 10:17 +0200, Mike Galbraith wrote:

On Sun, 2013-09-22 at 10:11 +0800, Jia He wrote:

In commit 0a2b9d4c,the update of semaphore's sem_otime(last semop time)
was removed because he wanted to move setting sem->sem_otime to one
place. But after that, the initial semop() will not set the otime
because its sem_op value is 0(in semtimedop,will not change
otime if alter == 1).

the error case:
process_a(server)   process_b(client)
semget()
semctl(SETVAL)
semop()
 semget()
 setctl(IP_STAT)
 for(;;) {   <--not successful here
   check until sem_otime > 0
 }

Good catch:
Since commit 0a2b9d4c, wait-for-zero semops do not update sem_otime anymore.

Let's reverse that part of my commit and move the update of sem_otime 
back into perform_atomic_semop().


Jia: If perform_atomic_semop() updates sem_otime, then the update in 
do_smart_update() is not necessary anymore.
Thus the whole logic with passing arround "semop_completed" can be 
removed, too.

Are you interested in writing that patch?



Why not..

(pokes evolution's don't-munge-me button)

ipc,sem: Create semaphores with plausible sem_otime.

Mike: no, your patch makes it worse:
- wait-for-zero semops still don't update sem_otime
- sem_otime is initialized to sem_ctime. That's not mentioned in the 
sysv standard.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] ipc: shm and msg fixes

2013-09-24 Thread Manfred Spraul

Hi Linus,

On 09/24/2013 03:22 AM, Linus Torvalds wrote:

On Mon, Sep 23, 2013 at 5:04 PM, Davidlohr Bueso  wrote:

Ok, so here's the code - again I've tested it with LTP on the resources
I have.

This looks good to me.

Manfred, mind giving this a look-over and see if this resolves your
race concerns too?

All race concerns with regards to code outside ipc are resolved.

My current list of open issues:

https://bugzilla.kernel.org/show_bug.cgi?id=61351
Fix is in mm tree (ipc-semc-fix-race-in-sem_lock.patch)

https://bugzilla.kernel.org/show_bug.cgi?id=61321
https://bugzilla.kernel.org/show_bug.cgi?id=61331
https://bugzilla.kernel.org/show_bug.cgi?id=61341
All 3 are fixed by Davidlohr's patch

https://bugzilla.kernel.org/show_bug.cgi?id=61361
https://bugzilla.kernel.org/show_bug.cgi?id=61371
Both still open. The fix is trivial:
Sprinkle a fair amount of "if (perm.deleted) return -EIDRM;" after 
ipc_lock.


And now new:
1) ipc/namespace.c:
free_ipcs() still assumes the "old style" free calls:
rcu_lock and ipc_lock dropped within the callback.

freeary() was converted - but free_ipcs was not updated.

Thus:
Closing a namespace with sem arrays and threads that are waiting on 
the array with semtimedop() and bad timing can deadlock the semtimedop 
thread.

(i.e.: spin_lock() waiting forever).

2) ipc/sem.c:
The proc interface calls ipc_lock() directly - thus the exclusion 
of simple semop's is missing with sysvipc_sem_proc_show().
A "sem_wait_array()" might be added as the first line into 
sysvipc_sem_proc_show().


It's more a correctness thing: Nothing breaks if get_semotime() is 
called in parallel with simple ops.


3) The missing update of sem_otime for simple ops that Jia He found
http://marc.info/?l=linux-kernel&m=137981594522009&w=2

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] ipc: shm and msg fixes

2013-09-24 Thread Manfred Spraul

Hi all,

On 09/24/2013 02:04 AM, Davidlohr Bueso wrote:

(In reality, I suspect the reference count is never elevated in
practice, so there is only one case that calls the security freeing
thing, so this may all be pretty much theoretical, but at least from a
logic standpoint the code clearly makes a big deal about the whole
refcount and "last user turns off the lights").

Right, this would/should have come up years ago if it were actually
occurring in practice.


As far as I see, the only requirement is "last user does kfree()":
spin_lock must be possible and perm.deleted must be valid.

e.g. from msg.c:

rcu_read_lock();
ipc_lock_object(&msq->q_perm);

ipc_rcu_putref(msq);
if (msq->q_perm.deleted) {
err = -EIDRM;
goto out_unlock0;
}





8<-
From: Davidlohr Bueso 
Subject: [PATCH] ipc: fix race with LSMs

Currently, IPC mechanisms do security and auditing related checks under
RCU. However, since security modules can free the security structure, for
example, through selinux_[sem,msg_queue,shm]_free_security(), we can race
if the structure is freed before other tasks are done with it, creating a
use-after-free condition. Manfred illustrates this nicely, for instance with
shared mem and selinux:

--> do_shmat calls rcu_read_lock()
--> do_shmat calls shm_object_check().
  Checks that the object is still valid - but doesn't acquire any locks.
  Then it returns.
--> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
--> selinux_shm_shmat calls ipc_has_perm()
--> ipc_has_perm accesses ipc_perms->security

shm_close()
--> shm_close acquires rw_mutex & shm_lock
--> shm_close calls shm_destroy
--> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
--> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
--> ipc_free_security calls kfree(ipc_perms->security)

This patch delays the freeing of the security structures after all RCU readers
are done. Furthermore it aligns the security life cycle with that of the rest of
IPC - freeing them based on the reference counter. For situations where we need
not free security, the current behavior is kept. Linus states:

"... the old behavior was suspect for another reason too: having
the security blob go away from under a user sounds like it could cause
various other problems anyway, so I think the old code was at least
_prone_ to bugs even if it didn't have catastrophic behavior."

I have tested this patch with IPC testcases from LTP on both my quad-core
laptop and on a 64 core NUMA server. In both cases selinux is enabled, and
tests pass for both voluntary and forced preemption models. While the mentioned
races are theoretical (at least no one as reported them), I wanted to make
sure that this new logic doesn't break anything we weren't aware of.

Suggested-by: Linus Torvalds 
Signed-off-by: Davidlohr Bueso 

Signed-off-by: Manfred Spraul 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: fix update sem_otime when calling sem_op in semaphore initialization

2013-09-24 Thread Manfred Spraul

On 09/22/2013 05:14 PM, Jia He wrote:

  Hi Manfred

On Sun, 22 Sep 2013 12:42:05 +0200 from manf...@colorfullife.com wrote:

Hi all,

On 09/22/2013 10:26 AM, Mike Galbraith wrote:

On Sun, 2013-09-22 at 10:17 +0200, Mike Galbraith wrote:

On Sun, 2013-09-22 at 10:11 +0800, Jia He wrote:

In commit 0a2b9d4c,the update of semaphore's sem_otime(last semop time)
was removed because he wanted to move setting sem->sem_otime to one
place. But after that, the initial semop() will not set the otime
because its sem_op value is 0(in semtimedop,will not change
otime if alter == 1).

the error case:
process_a(server)   process_b(client)
semget()
semctl(SETVAL)
semop()
  semget()
  setctl(IP_STAT)
  for(;;) {   <--not successful here
check until sem_otime > 0
  }

Good catch:
Since commit 0a2b9d4c, wait-for-zero semops do not update sem_otime anymore.

Let's reverse that part of my commit and move the update of sem_otime back
into perform_atomic_semop().

Jia: If perform_atomic_semop() updates sem_otime, then the update in
do_smart_update() is not necessary anymore.
Thus the whole logic with passing arround "semop_completed" can be removed, too.
Are you interested in writing that patch?


Not all perform_atomic_semop() can cover the points of do_smart_update()
after I move the parts of updating otime.
Eg. in semctl_setval/exit_sem/etc.That is, it seems I need to write some
other codes to update sem_otime outside perform_atomic_semop(). That
still violate your original goal---let one function do all the update otime
things.

No. The original goal was an optimization:
The common case is one semop that increases a semaphore value - and that 
allows another semop that is sleeping to proceed.

Before, this caused two get_seconds() calls.
The current (buggy) code avoids the 2nd call.


IMO, what if just remove the condition alter in sys_semtimedop:
-if (alter && error == 0)
+   if (error == 0)
 do_smart_update(sma, sops, nsops, 1, &tasks);
In old codes, it would set the otime even when sem_op == 0
do_smart_update() can be expensive - thus it shouldn't be called if we 
didn't change anything.


Attached is a proposed patch - fully untested. It is intended to be 
applied on top of Jia's patch.


_If_ the performance degradation is too large, then the alternative 
would be to set sem_otime directly in semtimedop() for successfull 
non-alter operations.


--
Manfred
>From 07da483abc88748a666f655f3e1d81a94df88e38 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Tue, 24 Sep 2013 22:53:27 +0200
Subject: [PATCH] ipc/sem.c: Simplify sem_otime handling.

The sem_otime handling tries to minimize the number of calls to
get_seconds().
This generates some complexity (i.e.: pass around if any operation
completed) and introduced one bug (See previous commit to ipc/sem.c).

Therefore: Remove the whole logic - this removes 25 lines.

Disadvantages:
- One line of code duplication in exit_sem():
  The function must now update sem_otime, it can't rely on
  do_smart_update_queue() to perform that task.
- Two get_seconds() calls instead of one call for the common
  case of increasing a semaphore and waking up a sleeping task.

TODO:
1) How fast is get_seconds()?
  Is the optimization worth the effort?
2) Test it.

---
 ipc/sem.c | 61 ++---
 1 file changed, 18 insertions(+), 43 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 8e01e76..5e5d7b1 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -713,15 +713,13 @@ static int check_restart(struct sem_array *sma, struct sem_queue *q)
  * semaphore.
  * The tasks that must be woken up are added to @pt. The return code
  * is stored in q->pid.
- * The function returns 1 if at least one operation was completed successfully.
  */
-static int wake_const_ops(struct sem_array *sma, int semnum,
+static void wake_const_ops(struct sem_array *sma, int semnum,
 struct list_head *pt)
 {
 	struct sem_queue *q;
 	struct list_head *walk;
 	struct list_head *pending_list;
-	int semop_completed = 0;
 
 	if (semnum == -1)
 		pending_list = &sma->pending_const;
@@ -742,13 +740,9 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
 			/* operation completed, remove from queue & wakeup */
 
 			unlink_queue(sma, q);
-
 			wake_up_sem_queue_prepare(pt, q, error);
-			if (error == 0)
-semop_completed = 1;
 		}
 	}
-	return semop_completed;
 }
 
 /**
@@ -761,13 +755,11 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
  * do_smart_wakeup_zero() checks all required queue for wait-for-zero
  * operations, based on the actual changes that were performed on the
  * semaphore array.
- * The function returns 1 if at least one operation was completed successfully.
  */
-static int

Re: [PATCH] ipc/sem.c: fix update sem_otime when calling sem_op in semaphore initialization

2013-09-24 Thread Manfred Spraul

Hi Jia,

On 09/25/2013 05:05 AM, Jia He wrote:

  Hi Manfred
IIUC after reivewing your patch and src code, does it seem
sem_otime lost the chance to be updated when calling
semctl_main/semctl_setval?
In old codes, even whendo_smart_update(sma, NULL, 0, 0, &tasks),
the otime can be updated after several conditions checking.

The update is performed now performed inside perform_atomic_semop():

Old code:
perform_atomic_semop() does not update sem_otime. It just returns 0 for 
successfull operations.
This "0 returned" is passed upwards ("semop_completed") into 
do_smart_update()

do_smart_update() updates sem_otime.

New code:
perform_atomic_semop() updates sem_otime immediately (your change).
No need to keep track if a waiting operation was completed (my change).

I don't see a problem - perhaps I overlook something.
Which problem do you see?

--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc/sem.c: Synchronize the proc interface

2013-09-25 Thread Manfred Spraul
The proc interface is not aware of sem_lock(), it calls
ipc_lock_object() directly.
This means that simple semop() operations can run in parallel with the
proc interface.
Right now, this is uncritical, because the implementation doesn't
do anything that requires a proper synchronization.

But it is dangerous and therefore should be fixed.

Andrew:
- Could you include the patch in -mm and push it towards Linus?
- The patch depends on ipc-semc-fix-race-in-sem_lock.patch

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 4a92c04..e5d9bb8 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2095,6 +2095,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, 
void *it)
struct sem_array *sma = it;
time_t sem_otime;
 
+   /*
+* The proc interface isn't aware of sem_lock(), it calls
+* ipc_lock_object() directly (in sysvipc_find_ipc).
+* In order to stay compatible with sem_lock(), we must wait until
+* all simple semop() calls have left their critical regions.
+*/
+   sem_wait_array(sma);
+
sem_otime = get_semotime(sma);
 
return seq_printf(s,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO

2013-06-14 Thread Manfred Spraul

Hi all,

On 06/10/2013 07:16 PM, Manfred Spraul wrote:

Hi Andrew,

I have cleaned up/improved my updates to sysv sem.
Could you replace my patches in -akpm with this series?

- 1: cacheline align output from ipc_rcu_alloc
- 2: cacheline align semaphore structures
- 3: seperate-wait-for-zero-and-alter-tasks
- 4: Always-use-only-one-queue-for-alter-operations
- 5: Replace the global sem_otime with a distributed otime
- 6: Rename-try_atomic_semop-to-perform_atomic

Just to keep everyone updated:
I have updated my testapp:
https://github.com/manfred-colorfu/ipcscale/blob/master/sem-waitzero.cpp

Something like this gives a nice output:

# sem-waitzero -t 5 -m 0 | grep 'Cpus' | gawk '{printf("%f - 
%s\n",$7/$2,$0);}' | sort -n -r


The first number is the number of operations per cpu during 5 seconds.

Mike was kind enough to run in on a 32-core (4-socket) Intel system:
- master doesn't scale at all when multiple sockets are used:
interleave 4: (i.e.: use cpu 0, then 4, then 8 (2nd socket), then 12):
34,717586.00 - Cpus 1, interleave 4 delay 0: 34717586 in 5 secs
24,507337.50 - Cpus 2, interleave 4 delay 0: 49014675 in 5 secs
 3,487540.00 - Cpus 3, interleave 4 delay 0: 10462620 in 5 secs
 2,708145.00 - Cpus 4, interleave 4 delay 0: 10832580 in 5 secs
interleave 8: (i.e.: use cpu 0, then 8 (2nd socket):
34,587329.00 - Cpus 1, interleave 8 delay 0: 34587329 in 5 secs
 7,746981.50 - Cpus 2, interleave 8 delay 0: 15493963 in 5 secs

- with my patches applied, it scales linearly - but only sometimes
example for good scaling (18 threads in parallel - linear scaling):
33,928616.11 - Cpus 18, interleave 8 delay 0: 610715090 in 
5 secs

example for bad scaling:
5,829109.60 - Cpus 5, interleave 8 delay 0: 29145548 in 5 secs

For me, it looks like a livelock somewhere:
Good example: all threads contribute the same amount to the final result:

Result matrix:
  Thread   0: 33476433
  Thread   1: 33697100
  Thread   2: 33514249
  Thread   3: 33657413
  Thread   4: 33727959
  Thread   5: 33580684
  Thread   6: 33530294
  Thread   7: 33666761
  Thread   8: 33749836
  Thread   9: 32636493
  Thread  10: 33550620
  Thread  11: 33403314
  Thread  12: 33594457
  Thread  13: 1920
  Thread  14: 33503588
  Thread  15: 33585348
Cpus 16, interleave 8 delay 0: 536206469 in 5 secs

Bad example: one thread is as fast as it should be, others are slow:

Result matrix:
  Thread   0: 31629540
  Thread   1:  5336968
  Thread   2:  6404314
  Thread   3:  9190595
  Thread   4:  9681006
  Thread   5:  9935421
  Thread   6:  9424324
Cpus 7, interleave 8 delay 0: 81602168 in 5 secs


The results are not stable: the same test is sometimes fast, sometimes slow.
I have no idea where the livelock could be and I wasn't able to notice 
anything on my i3 laptop.


Thus: Who has an idea?
What I can say is that the livelock can't be in do_smart_update(): The 
function is never called.


--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO

2013-06-14 Thread Manfred Spraul

On 06/14/2013 09:05 PM, Mike Galbraith wrote:

32 of 64 cores DL980 without the -rt killing goto again loop removal I
showed you.  Unstable, not wonderful throughput.

Unfortunately the -rt approach is defintively unstable:

@@ -285,9 +274,29 @@ static inline int sem_lock(struct sem_ar
 * but have to wait for the global lock to be released.
 */
if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
-   spin_unlock(&sem->lock);
- spin_unlock_wait(&sma->sem_perm.lock);
-   goto again;
+   spin_lock(&sma->sem_perm.lock);
+   if (sma->complex_count)
+   goto wait_array;
+
+   /*
+* Acquiring our sem->lock under the global lock
+* forces new complex operations to wait for us
+* to exit our critical section.
+*/
+   spin_lock(&sem->lock);
+   spin_unlock(&sma->sem_perm.lock);


Assume there is one op (semctl(), whatever) that acquires the global 
lock - and a continuous stream of simple ops.

- spin_is_locked() returns true due to the semctl().
- then simple ops will switch to spin_lock(&sma->sem_perm.lock).
- since the spinlock is acquired, the next operation will get true from 
spin_is_locked().


It will stay that way around - as long as there is at least one op 
waiting for sma->sem_perm.lock.

With enough cpus, it will stay like this forever.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO

2013-06-15 Thread Manfred Spraul

On 06/14/2013 09:05 PM, Mike Galbraith wrote:

# Events: 802K cycles
#
# Overhead  Symbol
#   ..
#
 18.42%  [k] SYSC_semtimedop
 15.39%  [k] sem_lock
 10.26%  [k] _raw_spin_lock
  9.00%  [k] perform_atomic_semop
  7.89%  [k] system_call
  7.70%  [k] ipc_obtain_object_check
  6.95%  [k] ipcperms
  6.62%  [k] copy_user_generic_string
  4.16%  [.] __semop
  2.57%  [.] worker_thread(void*)
  2.30%  [k] copy_from_user
  1.75%  [k] sem_unlock
  1.25%  [k] ipc_obtain_object

~ 280 mio ops.
2.3% copy_from_user,
9% perform_atomic_semop.


# Events: 802K cycles
#
# Overhead   Symbol
#   ...
#
 17.38%  [k] SYSC_semtimedop
 13.26%  [k] system_call
 11.31%  [k] copy_user_generic_string
  7.62%  [.] __semop
  7.18%  [k] _raw_spin_lock
  5.66%  [k] ipcperms
  5.40%  [k] sem_lock
  4.65%  [k] perform_atomic_semop
  4.22%  [k] ipc_obtain_object_check
  4.08%  [.] worker_thread(void*)
  4.06%  [k] copy_from_user
  2.40%  [k] ipc_obtain_object
  1.98%  [k] pid_vnr
  1.45%  [k] wake_up_sem_queue_do
  1.39%  [k] sys_semop
  1.35%  [k] sys_semtimedop
  1.30%  [k] sem_unlock
  1.14%  [k] security_ipc_permission

~ 700 mio ops.
4% copy_from_user -> as expected a bit more
4.6% perform_atomic_semop --> less.

Thus: Could you send the oprofile output from perform_atomic_semop()?

Perhaps that gives us a hint.

My current guess:
sem_lock() somehow ends up in lock_array.
Lock_array scans all struct sem -> transfer of that cacheline from all 
cpus to the cpu that does the lock_array..
Then the next write by the "correct" cpu causes a transfer back when 
setting sem->pid.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc/sem.c: scan complex wait-for-zero after undefined updates

2013-06-15 Thread Manfred Spraul
After an update of a semaphore array that does not use struct sembuf,
complex wait-for-zero operations were not checked if they are able to
proceed.

Andrew: Could you add it to -mm?
I've introduced the bug with
   ipc-sem-separate-wait-for-zero-and-alter-tasks-into-seperate-queues.patch

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f9d1c06..ad9daca 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -785,8 +785,10 @@ static int do_smart_wakeup_zero(struct sem_array *sma, 
struct sembuf *sops,
 * Assume all were changed.
 */
for (i = 0; i < sma->sem_nsems; i++) {
-   if (sma->sem_base[i].semval == 0)
+   if (sma->sem_base[i].semval == 0) {
+   got_zero = 1;
semop_completed |= wake_const_ops(sma, i, pt);
+   }
}
}
/*
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues

2013-06-01 Thread Manfred Spraul

Hi Rik,

On 05/27/2013 07:57 PM, Rik van Riel wrote:

On 05/26/2013 05:08 AM, Manfred Spraul wrote:

Introduce seperate queues for operations that do not modify the
semaphore values.
Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations
   are always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9


Whether this complexity is wanted is not for
me to decide, as I am not the ipc/sem.c
maintainer. I'll leave that up to Andrew and Linus.


We can have only one: Either more logic or unoptimized loops.
But I don't think that the complexity increases that much, e.g. some 
parts (e.g. check_restart()) get much simpler.


But:
Mike Galbraith ran 3.10-rc3 with and without my changes on a 4-socket 
64-core system, and for me the results appears to be quite slow:
- semop-multi 256 64: around 600.000 ops/sec, both with and without my 
additional patches [difference around 1%]
That is slower than my 1.4 GHz i3 with 3.9 - I get around 1.000.000 
ops/sec

Is that expected?
My only idea would be trashing from writing sma->sem_otime.

- osim [i.e.: with reschedules] is much slower: around 21 us per schedule.
Perhaps the scheduler didn't pair the threads optimally: intra-cpu 
reschedules

take around 2 us on my i3, inter-cpu reschedules around 16 us.

Thus I have attached my test apps.
- psem: psem tests sleeping semaphore operations.
Pairs of two threads perform ping-pong operations, starting with 1 
semaphore and increasing up to the given max.
Either bound to the same cpu ("intra-cpu") or bound to different 
cpus ("inter-cpu").
Inter-cpu is hardcoded, probably always a different socket 
(distance max_cpus/2).


- semscale performs operations that never block, i.e. like your 
semop-multi.c

It does:
- delays in user space to figure out what is the maximum number of 
operations possible taking into account that user space will do something.

- use interleaving, to force the threads to different cores/sockets.

Perhaps something in 3.0.10-rc3 breaks the scalability?

--
Manfred
/*
 * psem.cpp, parallel sysv sem pingpong
 *
 * Copyright (C) 1999, 2001, 2005, 2008 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header$
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifdef __sun
	 #include  /* P_PID, processor_bind() */
#endif

#undef VERBOSE

//

static enum {
	WAITING,
	RUNNING,
	STOPPED,
} volatile g_state = WAITING;

unsigned long long *g_results;
int *g_svsem_ids;
int *g_svsem_nrs;
pthread_t *g_threads;

struct taskinfo {
	int svsem_id;
	int svsem_nr;
	int threadid;
	int cpubind;
	int sender;
};

void bind_cpu(int cpunr)
{
#if __sun
	int ret;
	ret = processor_bind(P_PID, getpid(), cpunr, NULL);
	if (ret == -1) {
		perror("bind_thread:processor_bind");
		printf(" Binding to cpu %d failed.\n", cpunr);
	}
#else
	int ret;
	cpu_set_t cpus;
	cpu_set_t v;
	CPU_ZERO(&cpus);
	CPU_SET(cpunr, &cpus);
	pthread_t self;

	self = pthread_self();

	ret = pthread_setaffinity_np(self, sizeof(cpus), &cpus);
	if (ret < 0) {
		printf("pthread_setaffinity_np failed for thread %p with errno %d.\n",
(void*)self, errno);
	}

	ret = pthread_getaffinity_np(self, sizeof(v), &v);
	if (ret < 0) {
		printf("pthread_getaffinity_np() failed for thread %p with errno %d.\n",
(void*)self, errno);
		fflush(stdout);
	}
	if (memcmp(&v, &cpus, sizeof(cpus) != 0)) {
		printf("Note: Actual affinity does not match intention: got 0x%08lx, expected 0x%08lx.\n",
			(unsigned long)v.__bits[0], (unsigned long)cpus.__bits[0]);
	}
	fflush(stdout);
#endif
}
#define DATASIZE	8

void* worker_thread(void *arg)
{
	struct taskinfo *ti = (struct taskinfo*)arg;
	unsigned long long rounds;
	int ret;

	bind_cpu(ti->cpubind);
#ifdef VERBOSE
	printf("thread %d: sysvsem %8d, off %8d type %d bound to cpu %d\n",ti->threadid,
			ti->svsem_id, ti->svsem_nr,
			ti->sender, ti->cpubind);
#endif
	
	rounds = 0;
	while(g_state == WAITING) {
#ifdef __GNUC__
#if defined(__i386__) || defined (__x86_64__)
		__asm__ __volatile__("pause": : :"memory");
#else
		__asm__ __volatile__("": : :"memory");
#endif
#endif
	}

	if (ti->sender) {
		struct sembuf sop[1];

		/* 1) insert token */
		sop[0].sem_num=ti->svsem_nr+0;
		sop[0].sem_op=1;
		sop[0].sem_flg=0;
		ret = semop(ti->svsem_id,sop,1);
	
		if (ret != 0) {
			printf("Initial semop failed, ret %d, errno %d.\n", ret, errno);
			exit(1);
		}
	}
	while(g_state == RUNNING) {
		struct 

Re: [PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues

2013-06-01 Thread Manfred Spraul

Hi all,

On 06/01/2013 11:20 AM, Manfred Spraul wrote:


- osim [i.e.: with reschedules] is much slower: around 21 us per 
schedule.
Perhaps the scheduler didn't pair the threads optimally: intra-cpu 
reschedules

take around 2 us on my i3, inter-cpu reschedules around 16 us.


I mixed up numbers.
osim reports around 1.6 us for the 64-thread system.
It is still odd that it is only factor 1.5 facter than my 4-thread i3, 
but at least it is not slower.


Anyway: I have attached the table of all results that I have so far.

--
Manfred


Eval-ipc.ods
Description: application/vnd.oasis.opendocument.spreadsheet


sem_otime trashing

2013-06-01 Thread Manfred Spraul

Hi Rik,

I finally managed to get EFI boot, i.e. I'm now able to test on my i3 
(2core+HT).


With semscale (i.e.: just overhead, perform semop=0 operations), the 
scalability from 1 to 2 cores is good, but not linear:

# semscale 10 | grep "interleave 2"

Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
Cpus 2, interleave 2 delay 0: 53990954 in 10 secs

---
 +53% when adding the 2nd core
(interleave 2 to force to use different cores)

Did you consider moving sem_otime into the individual semaphores?
I did that (gross patch attached), and the performance is significantly 
better:


# semscale 10 | grep "interleave 2"
Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
 ---
+99% scalability when adding the 2nd core

Unfortunately I won't be able to read my mails next week, but the effect 
was too significant not to share it immediately.


--
Manfred
diff --git a/Makefile b/Makefile
index 73e20db..42137ab 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
 PATCHLEVEL = 10
 SUBLEVEL = 0
-EXTRAVERSION = -rc3
+EXTRAVERSION = -rc3-otime
 NAME = Unicycling Gorilla
 
 # *DOCUMENTATION*
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 55e17f6..976ce3a 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -12,7 +12,6 @@ struct task_struct;
 struct sem_array {
struct kern_ipc_permcacheline_aligned_in_smp
sem_perm;   /* permissions .. see ipc.h */
-   time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
struct list_headpending_alter;  /* pending operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 1dbb2fa..e5f000f 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,6 +92,7 @@
 
 /* One semaphore structure for each semaphore in the system. */
 struct sem {
+   charfiller[64];
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
@@ -99,7 +100,8 @@ struct sem {
/* that alter the semaphore */
struct list_head pending_const; /* pending single-sop operations */
/* that do not alter the semaphore*/
-};
+   time_t  sem_otime;  /* candidate for sem_otime */
+} cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
@@ -919,8 +921,14 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
}
}
}
-   if (otime)
-   sma->sem_otime = get_seconds();
+   if (otime) {
+   if (sops == NULL) {
+   sma->sem_base[0].sem_otime = get_seconds();
+   } else {
+   sma->sem_base[sops[0].sem_num].sem_otime =
+   get_seconds();
+   }
+   }
 }
 
 
@@ -1066,6 +1074,21 @@ static unsigned long copy_semid_to_user(void __user 
*buf, struct semid64_ds *in,
}
 }
 
+static time_t get_semotime(struct sem_array *sma)
+{
+   int i;
+   time_t res;
+
+   res = sma->sem_base[0].sem_otime;
+   for (i = 1; i < sma->sem_nsems; i++) {
+   time_t to = sma->sem_base[i].sem_otime;
+
+   if (to > res)
+   res = to;
+   }
+   return res;
+}
+
 static int semctl_nolock(struct ipc_namespace *ns, int semid,
 int cmd, int version, void __user *p)
 {
@@ -1139,9 +1162,9 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
goto out_unlock;
 
kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
-   tbuf.sem_otime  = sma->sem_otime;
-   tbuf.sem_ctime  = sma->sem_ctime;
-   tbuf.sem_nsems  = sma->sem_nsems;
+   tbuf.sem_otime = get_semotime(sma);
+   tbuf.sem_ctime = sma->sem_ctime;
+   tbuf.sem_nsems = sma->sem_nsems;
rcu_read_unlock();
if (copy_semid_to_user(p, &tbuf, version))
return -EFAULT;
@@ -2029,6 +2052,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
 {
struct user_namespace *user_ns = seq_user_ns(s);
struct sem_array *sma = it;
+   time_t sem_otime;
+
+   sem_otime = get_semotime(sma);
 
return seq_printf(s,
  "%10d %10d  %4o %10u %5u %5u %5u %5u %10lu %10lu\n",
@@ -2040,7 +2066,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
  from_kgid_munged(user_ns, sma->sem_perm.gid),
  from_kuid_munged(user_ns, sma

[PATCH 5/6] ipc/sem.c: Replace shared sem_otime with per-semaphore value

2013-06-10 Thread Manfred Spraul
sem_otime contains the time of the last semaphore operation
that completed successfully. Every operation updates this
value, thus access from multiple cpus can cause trashing.

Therefore the patch replaces the variable with a per-semaphore
variable. The per-array sem_otime is only calculated when required.

No performance improvement on a single-socket i3 - only important
for larger systems.

Signed-off-by: Manfred Spraul 
---
 include/linux/sem.h |  1 -
 ipc/sem.c   | 37 +++--
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 55e17f6..976ce3a 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -12,7 +12,6 @@ struct task_struct;
 struct sem_array {
struct kern_ipc_permcacheline_aligned_in_smp
sem_perm;   /* permissions .. see ipc.h */
-   time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
struct list_headpending_alter;  /* pending operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index dcf99ef..e6d21f6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -99,6 +99,7 @@ struct sem {
/* that alter the semaphore */
struct list_head pending_const; /* pending single-sop operations */
/* that do not alter the semaphore*/
+   time_t  sem_otime;  /* candidate for sem_otime */
 } cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
@@ -909,8 +910,14 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
}
}
}
-   if (otime)
-   sma->sem_otime = get_seconds();
+   if (otime) {
+   if (sops == NULL) {
+   sma->sem_base[0].sem_otime = get_seconds();
+   } else {
+   sma->sem_base[sops[0].sem_num].sem_otime =
+   get_seconds();
+   }
+   }
 }
 
 
@@ -1056,6 +1063,21 @@ static unsigned long copy_semid_to_user(void __user 
*buf, struct semid64_ds *in,
}
 }
 
+static time_t get_semotime(struct sem_array *sma)
+{
+   int i;
+   time_t res;
+
+   res = sma->sem_base[0].sem_otime;
+   for (i = 1; i < sma->sem_nsems; i++) {
+   time_t to = sma->sem_base[i].sem_otime;
+
+   if (to > res)
+   res = to;
+   }
+   return res;
+}
+
 static int semctl_nolock(struct ipc_namespace *ns, int semid,
 int cmd, int version, void __user *p)
 {
@@ -1129,9 +1151,9 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
goto out_unlock;
 
kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
-   tbuf.sem_otime  = sma->sem_otime;
-   tbuf.sem_ctime  = sma->sem_ctime;
-   tbuf.sem_nsems  = sma->sem_nsems;
+   tbuf.sem_otime = get_semotime(sma);
+   tbuf.sem_ctime = sma->sem_ctime;
+   tbuf.sem_nsems = sma->sem_nsems;
rcu_read_unlock();
if (copy_semid_to_user(p, &tbuf, version))
return -EFAULT;
@@ -2019,6 +2041,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
 {
struct user_namespace *user_ns = seq_user_ns(s);
struct sem_array *sma = it;
+   time_t sem_otime;
+
+   sem_otime = get_semotime(sma);
 
return seq_printf(s,
  "%10d %10d  %4o %10u %5u %5u %5u %5u %10lu %10lu\n",
@@ -2030,7 +2055,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
  from_kgid_munged(user_ns, sma->sem_perm.gid),
  from_kuid_munged(user_ns, sma->sem_perm.cuid),
  from_kgid_munged(user_ns, sma->sem_perm.cgid),
- sma->sem_otime,
+ sem_otime,
  sma->sem_ctime);
 }
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/6] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update

2013-06-10 Thread Manfred Spraul
Cleanup: Some minor points that I noticed while writing the
previous patches

1) The name try_atomic_semop() is misleading: The function performs the
   operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index e6d21f6..f9d1c06 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -154,12 +154,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define SEMOPM_FAST64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  * sem_undo.id_next,
+ * sem_array.complex_count,
  * sem_array.pending{_alter,_cont},
- * sem_array.sem_undo: sem_lock() for read/write
+ * sem_array.sem_undo: global sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
+ * sem_array.sem_base[i].pending_{const,alter}:
+ * global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -536,12 +539,19 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, 
semflg)
return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
 }
 
-/*
- * Determine whether a sequence of semaphore operations would succeed
- * all at once. Return 0 if yes, 1 if need to sleep, else return error code.
+/** perform_atomic_semop - Perform (if possible) a semaphore operation
+ * @sma: semaphore array
+ * @sops: array with operations that should be checked
+ * @nsems: number of sops
+ * @un: undo array
+ * @pid: pid that did the change
+ *
+ * Returns 0 if the operation was possible.
+ * Returns 1 if the operation is impossible, the caller must sleep.
+ * Negative values are error codes.
  */
 
-static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
+static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
 int nsops, struct sem_undo *un, int pid)
 {
int result, sem_op;
@@ -724,8 +734,8 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
q = container_of(walk, struct sem_queue, list);
walk = walk->next;
 
-   error = try_atomic_semop(sma, q->sops, q->nsops,
-   q->undo, q->pid);
+   error = perform_atomic_semop(sma, q->sops, q->nsops,
+q->undo, q->pid);
 
if (error <= 0) {
/* operation completed, remove from queue & wakeup */
@@ -836,7 +846,7 @@ again:
if (semnum != -1 && sma->sem_base[semnum].semval == 0)
break;
 
-   error = try_atomic_semop(sma, q->sops, q->nsops,
+   error = perform_atomic_semop(sma, q->sops, q->nsops,
 q->undo, q->pid);
 
/* Does q->sleeper still need to sleep? */
@@ -1680,7 +1690,6 @@ static int get_queue_result(struct sem_queue *q)
return error;
 }
 
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1778,7 +1787,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
if (un && un->semid == -1)
goto out_unlock_free;
 
-   error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
+   error = perform_atomic_semop(sma, sops, nsops, un,
+   task_tgid_vnr(current));
if (error <= 0) {
if (alter && error == 0)
do_smart_update(sma, sops, nsops, 1, &tasks);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/6] ipc/sem.c: Always use only one queue for alter operations.

2013-06-10 Thread Manfred Spraul
There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups:
Right now, complex operations have priority over simple ops.

The patch restores the behavior of linux <=3.0.9: The longest
waiting operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler:
No more goto logic.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 128 ++
 1 file changed, 88 insertions(+), 40 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index e7f3d64..dcf99ef 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -192,6 +192,53 @@ void __init sem_init (void)
IPC_SEM_IDS, sysvipc_sem_proc_show);
 }
 
+/**
+ * unmerge_queues - unmerge queues, if possible.
+ * @sma: semaphore array
+ *
+ * The function unmerges the wait queues if complex_count is 0.
+ * It must be called prior to dropping the global semaphore array lock.
+ */
+static void unmerge_queues(struct sem_array *sma)
+{
+   struct sem_queue *q, *tq;
+
+   /* complex operations still around? */
+   if (sma->complex_count)
+   return;
+   /*
+* We will switch back to simple mode.
+* Move all pending operation back into the per-semaphore
+* queues.
+*/
+   list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
+   struct sem *curr;
+   curr = &sma->sem_base[q->sops[0].sem_num];
+
+   list_add_tail(&q->list, &curr->pending_alter);
+   }
+   INIT_LIST_HEAD(&sma->pending_alter);
+}
+
+/**
+ * merge_queues - Merge single semop queues into global queue
+ * @sma: semaphore array
+ *
+ * This function merges all per-semaphore queues into the global queue.
+ * It is necessary to achieve FIFO ordering for the pending single-sop
+ * operations when a multi-semop operation must sleep.
+ * Only the alter operations must be moved, the const operations can stay.
+ */
+static void merge_queues(struct sem_array *sma)
+{
+   int i;
+   for (i = 0; i < sma->sem_nsems; i++) {
+   struct sem *sem = sma->sem_base + i;
+
+   list_splice_init(&sem->pending_alter, &sma->pending_alter);
+   }
+}
+
 /*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
@@ -262,6 +309,7 @@ static inline int sem_lock(struct sem_array *sma, struct 
sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
if (locknum == -1) {
+   unmerge_queues(sma);
spin_unlock(&sma->sem_perm.lock);
} else {
struct sem *sem = sma->sem_base + locknum;
@@ -829,49 +877,38 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
int otime, struct list_head *pt)
 {
int i;
-   int progress;
 
otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
 
-   progress = 1;
-retry_global:
-   if (sma->complex_count) {
-   if (update_queue(sma, -1, pt)) {
-   progress = 1;
-   otime = 1;
-   sops = NULL;
-   }
-   }
-   if (!progress)
-   goto done;
-
-   if (!sops) {
-   /* No semops; something special is going on. */
-   for (i = 0; i < sma->sem_nsems; i++) {
-   if (update_queue(sma, i, pt)) {
-   otime = 1;
-   progress = 1;
+   if (!list_empty(&sma->pending_alter)) {
+   /* semaphore array uses the global queue - just process it. */
+   otime |= update_queue(sma, -1, pt);
+   } else {
+   if (!sops) {
+   /*
+* No sops, thus the modified semaphores are not
+* known. Check all.
+*/
+   for (i = 0; i < sma->sem_nsems; i++)
+   otime |= update_queue(sma, i, pt);
+   } else {
+   /*
+* Check the semaphores that were increased:
+* - No complex ops, thus all sleeping ops are
+*   decrease.
+* - if we decreased the value, then any sleeping
+*   semaphore ops wont be able to run: If the
+*   previous value was too small, then the new
+

[PATCH 3/6] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues

2013-06-10 Thread Manfred Spraul
Introduce seperate queues for operations that do not modify the
semaphore values.
Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations
  are always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul 
---
 include/linux/sem.h |   5 +-
 ipc/sem.c   | 209 +---
 2 files changed, 153 insertions(+), 61 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..55e17f6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,7 +15,10 @@ struct sem_array {
time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
-   struct list_headsem_pending;/* pending operations to be 
processed */
+   struct list_headpending_alter;  /* pending operations */
+   /* that alter the array */
+   struct list_headpending_const;  /* pending complex operations */
+   /* that do not alter semvals */
struct list_headlist_id;/* undo requests on this array 
*/
int sem_nsems;  /* no. of semaphores in array */
int complex_count;  /* pending complex operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 1afbc57..e7f3d64 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
-   struct list_head sem_pending; /* pending single-sop operations */
+   struct list_head pending_alter; /* pending single-sop operations */
+   /* that alter the semaphore */
+   struct list_head pending_const; /* pending single-sop operations */
+   /* that do not alter the semaphore*/
 } cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
@@ -152,7 +155,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 /*
  * linked list protection:
  * sem_undo.id_next,
- * sem_array.sem_pending{,last},
+ * sem_array.pending{_alter,_cont},
  * sem_array.sem_undo: sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
@@ -337,7 +340,7 @@ static inline void sem_rmid(struct ipc_namespace *ns, 
struct sem_array *s)
  * Without the check/retry algorithm a lockless wakeup is possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- * * unlinking the queue entry from sma->sem_pending
+ * * unlinking the queue entry from the pending list
  * * setting queue.status to IN_WAKEUP
  *   This is the notification for the blocked thread that a
  *   result value is imminent.
@@ -418,12 +421,14 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
sma->sem_base = (struct sem *) &sma[1];
 
for (i = 0; i < nsems; i++) {
-   INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+   INIT_LIST_HEAD(&sma->sem_base[i].pending_alter);
+   INIT_LIST_HEAD(&sma->sem_base[i].pending_const);
spin_lock_init(&sma->sem_base[i].lock);
}
 
sma->complex_count = 0;
-   INIT_LIST_HEAD(&sma->sem_pending);
+   INIT_LIST_HEAD(&sma->pending_alter);
+   INIT_LIST_HEAD(&sma->pending_const);
INIT_LIST_HEAD(&sma->list_id);
sma->sem_nsems = nsems;
sma->sem_ctime = get_seconds();
@@ -609,60 +614,130 @@ static void unlink_queue(struct sem_array *sma, struct 
sem_queue *q)
  * update_queue is O(N^2) when it restarts scanning the whole queue of
  * waiting operations. Therefore this function checks if the restart is
  * really necessary. It is called after a previously waiting operation
- * was completed.
+ * modified the array.
+ * Note that wait-for-zero operations are handled without restart.
  */
 static int check_restart(struct sem_array *sma, struct sem_queue *q)
 {
-   struct sem *curr;
-   struct sem_queue *h;
-
-   /* if the operation didn't modify the array, then no restart */
-   if (q->alter == 0)
-   return 0;
-
-   /* pending complex operations are too difficult to analyse */
-   if (sma->complex_count)
+   /* pending complex alter operations are too difficult to analyse */
+   if (!list_empty(&sma->pending_alter))
return 1;
 
   

[PATCH 1/6] ipc/util.c, ipc_rcu_alloc: cacheline align allocation

2013-06-10 Thread Manfred Spraul
Enforce that ipc_rcu_alloc returns a cacheline aligned pointer on SMP.

Rational:
The SysV sem code tries to move the main spinlock into a seperate cacheline
(cacheline_aligned_in_smp). This works only if ipc_rcu_alloc returns
cacheline aligned pointers.
vmalloc and kmalloc return cacheline algined pointers, the implementation
of ipc_rcu_alloc breaks that.

Andrew: Could you merge it into -akpm and then forward it towards Linus' tree?

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 809ec5e..9623c8e 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -469,9 +469,7 @@ void ipc_free(void* ptr, int size)
 struct ipc_rcu {
struct rcu_head rcu;
atomic_t refcount;
-   /* "void *" makes sure alignment of following data is sane. */
-   void *data[0];
-};
+} cacheline_aligned_in_smp;
 
 /**
  * ipc_rcu_alloc   -   allocate ipc and rcu space 
@@ -489,12 +487,14 @@ void *ipc_rcu_alloc(int size)
if (unlikely(!out))
return NULL;
atomic_set(&out->refcount, 1);
-   return out->data;
+   return out+1;
 }
 
 int ipc_rcu_getref(void *ptr)
 {
-   return atomic_inc_not_zero(&container_of(ptr, struct ipc_rcu, 
data)->refcount);
+   struct ipc_rcu *p = ((struct ipc_rcu*)ptr)-1;
+
+   return atomic_inc_not_zero(&p->refcount);
 }
 
 /**
@@ -508,7 +508,7 @@ static void ipc_schedule_free(struct rcu_head *head)
 
 void ipc_rcu_putref(void *ptr)
 {
-   struct ipc_rcu *p = container_of(ptr, struct ipc_rcu, data);
+   struct ipc_rcu *p = ((struct ipc_rcu*)ptr)-1;
 
if (!atomic_dec_and_test(&p->refcount))
return;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/6] ipc/sem.c: performance improvements, FIFO

2013-06-10 Thread Manfred Spraul
Hi Andrew,

I have cleaned up/improved my updates to sysv sem.
Could you replace my patches in -akpm with this series?

- 1: cacheline align output from ipc_rcu_alloc
- 2: cacheline align semaphore structures
- 3: seperate-wait-for-zero-and-alter-tasks
- 4: Always-use-only-one-queue-for-alter-operations
- 5: Replace the global sem_otime with a distributed otime
- 6: Rename-try_atomic_semop-to-perform_atomic

The first 2 patches result up to a ~30% performance improvement on a 2-core i3.

3,4 remove unnecessary loops in do_smart_update() and restore FIFO behavior.

I would expect that the 5th one will cause an improvement for multi-socket
systems, but I don't have a test setup.

6 is just a cleanup/function rename.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/6] ipc/sem.c: cacheline align the semaphore structures

2013-06-10 Thread Manfred Spraul
As now each semaphore has it's own spinlock and parallel operations
are possible, give each semaphore it's own cacheline.

On a i3 laptop, this gives up to 28% better performance:

#semscale 10 | grep "interleave 2"
- before:
Cpus 1, interleave 2 delay 0: 36109234 in 10 secs
Cpus 2, interleave 2 delay 0: 55276317 in 10 secs
Cpus 3, interleave 2 delay 0: 62411025 in 10 secs
Cpus 4, interleave 2 delay 0: 81963928 in 10 secs

-after:
Cpus 1, interleave 2 delay 0: 35527306 in 10 secs
Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28%
Cpus 3, interleave 2 delay 0: 80518538 in 10 secs
Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7%

i3, with 2 cores and with hyperthreading enabled.
Interleave 2 in order use first the full cores.
HT partially hides the delay from cacheline trashing, thus
the improvement is "only" 8.7% if 4 threads are running.

Andrew: Could you merge it into -akpm and then forward it towards Linus' tree?

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 70480a3..1afbc57 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -96,7 +96,7 @@ struct sem {
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
struct list_head sem_pending; /* pending single-sop operations */
-};
+} cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
-- 
1.8.1.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: manual merge of the akpm tree with the tree

2013-05-06 Thread Manfred Spraul

Hi,

On 05/06/2013 05:51 AM, Stephen Rothwell wrote:

Hi Andrew,

Today's linux-next merge of the akpm tree got a conflict in ipc/sem.c
between commit 321310ced2d6 ("ipc: move sem_obtain_lock() rcu locking
into the only caller") from Linus' tree and commit "ipc/sem.c:
alternatives to preempt_disable()" from the akpm tree.
I would propose that 'ipc/sem.c: alternatives to preempt_disable()' is 
dropped.


As explained in the commit:
That patch is for CONFIG_PREEMPT_RT, it should go into their tree.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ipc/sem.c: Lockup with complex ops, comments not updated

2013-05-18 Thread Manfred Spraul

Hi Rik,

I like your change to the ipc/sem locking:
A scheme with a per-semaphore lock and without the overhead of always 
acquiring both the global and the per-semaphore lock.


But:
1) I found one bug with your sem locking changes:
If
- a complex operation is sleeping [would be woken up by update_queue(,-1)]
- a simple op is sleeping
- the success of the simple op would allow the complex op to complete
[i.e.: update_queue(,sem_num) changes the semaphore value to the 
value that the complex op waits on]

- an operation wakes up the simple op.

then the complex op is not woken up.

One fix would be a loop in do_smart_update():
- first check the global queue
- then the per-semaphore queues
- if one of the per-semaphore queues made progress: check the global 
queue again

- if the global queue made progress: check the per semaphore queues again
...

2) Your patches remove FIFO ordering of the wakeups:
As far as I can see complex ops are now preferred over simple ops.
It's not a bug, noone exept linux implements FIFO.
But the comment it line 28 should be updated

Should I write a patch, do you want to fix it yourself?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc/sem.c: fix lockup, restore FIFO behavior

2013-05-25 Thread Manfred Spraul
Hi Rik,

I came up with a completely different approach:

The patch
a) fixes a lockup due to a missing restart.
b) makes the code again FIFO.

Changes:
- the wait-for-zero operations are moved into seperate lists. Thus they can
  be checked seperately, without rescanning the whole queue.
- If a complex operating arrives, then all pending change operations are
  moved into the global queue. This allows to keep everything FIFO.

Advantage:
- Fewer restarts in update_queue(), because pending wait-for-zero do not
  force a restart anymore.
- Efficient handling of wait-for-zero semops, both simple and complex.
- FIFO. Dropping FIFO is a user visible change, and I'm a coward.
- simpler check_restart logic.

Disadvantage:
When one complex operation arrives, then the semaphore array goes into a
complex_present mode that always acquires the global lock. Even when the
complex operations have completed, pending simple decrease operations
prevent the array from switching back. The switch happens when
there are only simple wait-for-zero semops (or no semops at all).

But: Let's wait if this really exists: An application that does rarely
complex operations (and that doesn't prefer FIFO semantics).

Other changes:
- try_atomic_semop() also performs the semop. Thus rename the function.

It passes tests with qemu, but not boot-tested due to EFI problems.

Signed-off-by: Manfred Spraul 

---
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..3f2c6c8 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,10 +15,14 @@ struct sem_array {
time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
-   struct list_headsem_pending;/* pending operations to be 
processed */
+   struct list_headpending_alter;  /* pending operations */
+   /* that alter the array */
+   struct list_headpending_const;  /* pending complex operations */
+   /* that do not alter semvals */
struct list_headlist_id;/* undo requests on this array 
*/
int sem_nsems;  /* no. of semaphores in array */
-   int complex_count;  /* pending complex operations */
+   int complex_present;
+   /* pending complex semops? */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index a7e40ed..bf220de 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
-   struct list_head sem_pending; /* pending single-sop operations */
+   struct list_head pending_alter; /* pending single-sop operations */
+   /* that alter the semaphore */
+   struct list_head pending_const; /* pending single-sop operations */
+   /* that do not alter the semaphore*/
 };
 
 /* One queue for each sleeping process in the system. */
@@ -150,12 +153,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define SEMOPM_FAST64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  * sem_undo.id_next,
- * sem_array.sem_pending{,last},
- * sem_array.sem_undo: sem_lock() for read/write
+ * sem_array.complex_present,
+ * sem_array.pending{_alter,_cont},
+ * sem_array.sem_undo: global sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
+ * sem_array.sem_base[i].pending_{const,alter}:
+ * global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -196,9 +202,9 @@ void __init sem_init (void)
  * multiple semaphores in our own semops, or we need to look at
  * semaphores from other pending complex operations.
  *
- * Carefully guard against sma->complex_count changing between zero
+ * Carefully guard against sma->complex_present changing between zero
  * and non-zero while we are spinning for the lock. The value of
- * sma->complex_count cannot change while we are holding the lock,
+ * sma->complex_present cannot change while we are holding the lock,
  * so sem_unlock should be fine.
  *
  * The global lock path checks that all the local locks have been released,
@@ -210,17 +216,17 @@ static inline int sem_lock(struct sem_array *sma, struct 
sembuf *sops,
 {
int locknum;
  again:
-   if (nsops == 1 && !sma->complex_count) {
+   if (nsops == 1 && !sma->complex_present) {
struct sem *sem = sm

Re: [PATCH 06/10] ipc: Don't allocate a copy larger than max

2013-05-25 Thread Manfred Spraul

Hi Peter,

You wrote:

When MSG_COPY is set, a duplicate message must be allocated for
the copy before locking the queue. However, the copy could
not be larger than was sent which is limited to msg_ctlmax.

Signed-off-by: Peter Hurley 
---
  ipc/msg.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 950572f..31cd1bf 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -820,15 +820,17 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
struct msg_msg *copy = NULL;
unsigned long copy_number = 0;
  
+	ns = current->nsproxy->ipc_ns;

+
if (msqid < 0 || (long) bufsz < 0)
return -EINVAL;
if (msgflg & MSG_COPY) {
-   copy = prepare_copy(buf, bufsz, msgflg, &msgtyp, ©_number);
+   copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax),
+   msgflg, &msgtyp, ©_number);


What about:
- increase msg_ctlmax
- send message
- reduce msg_ctlmax

The side effects of the patch are odd:
- without MSG_COPY, a message can be read regardsless of the size.
  The user could check for E2BIG and increase the buffer size until
  msgrcv succeeds.
- with MSG_COPY, something else would happen.
  As far as I can see, it would oops: msg_ctlmax bytes are allocated,
  then the E2BIG test is against bufsz, and copy_msg() doesn't check
  the size of the target buffer.

I.e.: I would propose to revert the patch.

--
Manfred
 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: fix lockup, restore FIFO behavior

2013-05-25 Thread Manfred Spraul

Hi Rik,

On 05/25/2013 03:49 PM, Rik van Riel wrote:

On 05/25/2013 04:54 AM, Manfred Spraul wrote:


But: Let's wait if this really exists: An application that does rarely
complex operations (and that doesn't prefer FIFO semantics).


I do not like that downside at all.

The danger of staying in "too slow to be useful" mode forever
is really not a risk I want to take.



I've modified the patch so that it leaves the mode as soon as possible.
Are you aware of any real-life apps that use multiple semops?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ipc/sem.c: fix lockup, restore FIFO behavior

2013-05-25 Thread Manfred Spraul
The double coward solution:
- wakeup stays FIFO
- fast switch back to per-semaphore spinlock mode

The patch
a) fixes a lockup due to a missing restart.
b) makes the wakeups again FIFO (as linux <= 3.0.9)
c) tries to limit the time while in global lock mode as much
   as possible. (same as linux-3.0.10-rc1)

Changes:
- the wait-for-zero operations are moved into seperate lists. Thus they can
  be checked seperately, without rescanning the whole queue.
- If a complex operation must sleep, then all pending change operations are
  moved into the global queue. This allows to keep everything FIFO.
- When all complex operations have completed, the simple ops are moved
  back into the per-semaphore queues.

Advantage:
- FIFO. Dropping FIFO is a user visible change, and I'm a coward.
- simpler check_restart logic.
- Efficient handling of wait-for-zero semops, both simple and complex.
- Fewer restarts in update_queue(), because pending wait-for-zero do not
  force a restart anymore.

Other changes:
- try_atomic_semop() also performs the semop. Thus rename the function.

It passes tests with qemu, but not boot-tested due to EFI problems.

Signed-off-by: Manfred Spraul 

---
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..55e17f6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,7 +15,10 @@ struct sem_array {
time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
-   struct list_headsem_pending;/* pending operations to be 
processed */
+   struct list_headpending_alter;  /* pending operations */
+   /* that alter the array */
+   struct list_headpending_const;  /* pending complex operations */
+   /* that do not alter semvals */
struct list_headlist_id;/* undo requests on this array 
*/
int sem_nsems;  /* no. of semaphores in array */
int complex_count;  /* pending complex operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index a7e40ed..81d32a7 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
-   struct list_head sem_pending; /* pending single-sop operations */
+   struct list_head pending_alter; /* pending single-sop operations */
+   /* that alter the semaphore */
+   struct list_head pending_const; /* pending single-sop operations */
+   /* that do not alter the semaphore*/
 };
 
 /* One queue for each sleeping process in the system. */
@@ -150,12 +153,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define SEMOPM_FAST64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  * sem_undo.id_next,
- * sem_array.sem_pending{,last},
- * sem_array.sem_undo: sem_lock() for read/write
+ * sem_array.complex_count,
+ * sem_array.pending{_alter,_cont},
+ * sem_array.sem_undo: global sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
+ * sem_array.sem_base[i].pending_{const,alter}:
+ * global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -189,6 +195,53 @@ void __init sem_init (void)
IPC_SEM_IDS, sysvipc_sem_proc_show);
 }
 
+/**
+ * unmerge_queues - unmerge queues, if possible.
+ * @sma: semaphore array
+ *
+ * The function unmerges the wait queues if complex_count is 0.
+ * It must be called prior to dropping the global semaphore array lock.
+ */
+static void unmerge_queues(struct sem_array *sma)
+{
+   struct sem_queue *q, *tq;
+
+   /* complex operations still around? */
+   if (sma->complex_count)
+   return;
+   /*
+* We will switch back to simple mode.
+* Move all pending operation back into the per-semaphore
+* queues.
+*/
+   list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
+   struct sem *curr;
+   curr = &sma->sem_base[q->sops[0].sem_num];
+
+   list_add_tail(&q->list, &curr->pending_alter);
+   }
+   INIT_LIST_HEAD(&sma->pending_alter);
+}
+
+/**
+ * merge_queues - Merge single semop queues into global queue
+ * @sma: semaphore array
+ *
+ * This function merges all per-semaphore queues into the global queue.
+ * It is necessary to achieve FIFO ordering for the pending single-sop
+ * operations when a multi-semop 

Re: [PATCH v2] ipc/sem.c: fix lockup, restore FIFO behavior

2013-05-25 Thread Manfred Spraul

On 05/25/2013 08:32 PM, Davidlohr Bueso wrote:

Yep, could you please explain what benefits you see in keeping FIFO order?

a) It's user space visible.

b) It's a well-defined behavior that might even make sense for some 
applications.
Right now, a 2 semop operation with "+1, then -2" is priorized over 
a semop with "-1".


And: It doesn't cost much:
- no impact for users that use only single-op operations.
- no impact for users that use only multi-op operations
- for users that use both types: In the worst case some linked list 
splicing.


Actually, the code is probably faster because wait-for-zero ops are only 
scanned when the semaphore values are 0.



Acked-by: Rik van Riel 


- simpler check_restart logic.
- Efficient handling of wait-for-zero semops, both simple and complex.
- Fewer restarts in update_queue(), because pending wait-for-zero do not
force a restart anymore.

Other changes:
- try_atomic_semop() also performs the semop. Thus rename the function.

It passes tests with qemu, but not boot-tested due to EFI problems.

I think this still needs a *lot* of testing - I don't have my Oracle
workload available right now, but I will definitely see how this patch
behaves on it. That said, I believe Oracle is are already quite happy
with the sem improvements.

Ah, ok.
The change is good for one application and the risk of breaking other 
apps is considered as negligible.




Furthermore, this patch is way too invasive for considering it for 3.10
- I like Rik's patch better because it simply addresses the issue and
nothing more.

I would disagree:
My patch is testable - with it applied, linux-3.0.10 should behave 
exactly as linux-3.0.9.

Except the scalability - the new sem_lock from Rik is great.

My problem with Rik's patch is that it is untestable:
It changes the behavior and we must hope that nothing breaks.

Actually, the latest patch makes it a bit worse:

@@ -720,16 +718,11 @@ static int update_queue(struct sem_array *sma, int 
semnum, struct list_head *pt)
  
  		unlink_queue(sma, q);
  
-		if (error) {

-   restart = 0;
-   } else {
-   semop_completed = 1;
-   restart = check_restart(sma, q);
-   }
+   semop_completed = 1;
+   if (check_restart(sma, q))
+   *restart = 1;
  
  		wake_up_sem_queue_prepare(pt, q, error);

-   if (restart)
-   goto again;
If check_restart returns "1", then the current (3.0.10-rc1) code 
restarts immediately ("goto a again").
Now the rest of the queue is processed completely and only afterwards it 
is scanned again.


This means that wait-for-zero now succeeds only if a semaphore value 
stays zero.

For 3.0.9, it was sufficient if the value was temporarily zero.
Before the change, complex wait-for-zero would work, only simple 
wait-for-zero would be starved.

Now all operations are starved.

I've attached a test case:
./test5.sh
linux-3.0.9 completes all operations
With Rik's patch, the wait-for-zero remains running.

--
Manfred

P.S.:
Btw, I found some code that uses a semop with 2 ops:
http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=%2Fapis%2Fapiexusmem.htm
/*
 * Copyright (C) 1999 by Manfred Spraul.
 * 
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL)
 * $Header: /home/manfred/cvs-tree/manfred/ipcsem/dec.c,v 1.5 2003/06/17 
16:16:55 manfred Exp $
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TRUE1
#define FALSE   0

union semun {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo* __buf;
};


int main(int argc,char** argv)
{
int id;
int key;
int res;
int *nr;
int *val;
int i;
int sems;

printf("change[...]\n");
if(argc < 4 || ((argc % 2) == 1)) {
printf("Invalid parameters.\n");
return 1;
}

key = atoi(argv[1]);
if(key == 0) {
printf("Invalid parameters: Key invalid.\n");
return 1;
}
if (key > 0) {
id = semget(key,1,0);
if(id == -1) {
printf(" findkey() failed.\n");
return 1;
}
} else {
id = -key;
printf(" findkey() bypassed, using id %d.\n", id);
}
sems = (argc-2)/2;
nr=(int*)malloc(sizeof(int)*sems);
val=(int*)malloc(sizeof(int)*sems);
if (!nr || !val) {
printf("Malloc failed.\n");
    return 1;
}
printf("pid %d: changing %d semaphores:\n",getpid(),

Re: [PATCH] ipc,sem: move restart loop to do_smart_update

2013-05-25 Thread Manfred Spraul

Hi Rik,

On 05/20/2013 12:32 AM, Rik van Riel wrote:

@@ -751,17 +744,19 @@ static int update_queue(struct sem_array *sma, int 
semnum, struct list_head *pt)
  static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int 
nsops,
int otime, struct list_head *pt)
  {
-   int i;
+   int i, restart;
  
+ again:

+   restart = 0;
if (sma->complex_count || sops == NULL) {
-   if (update_queue(sma, -1, pt))
+   if (update_queue(sma, -1, pt, &restart))
otime = 1;
}

I didn't notice it immediately, this is insufficient:
Complex operations may affect all semaphores in the array.
Thus it is not sufficient to scan the per-semaphore queues of the 
semaphores that were touched by *sops, all must be scanned.


Perhaps something like:
> +if (update_queue(sma, -1, pt, &restart))
> +sops = NULL;
>  otime = 1;
>  }
(untested!)

Test case:
1: semop(, {{1, -1, 0}, {2, 1, 0}}, 2 
2: semop(, {{2, -1, 0}}, 1 
3: semop(, {{1, 1, 0}}, 1) = 0

[3] is able to run, calls do_smart_update().
The global queue is scanned and [1] is released
Without the "sops = NULL", [2] sleeps forever.

--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3

2013-05-26 Thread Manfred Spraul
I've split my patch into 4 parts:
- 1: Fix-missing-wakeups-in-do_smart_update
- 2: seperate-wait-for-zero-and-alter-tasks
- 3: Always-use-only-one-queue-for-alter-operations
- 4: Rename-try_atomic_semop-to-perform_atomic

Linus:
- Patch 1 should be merged immediately: It fixes bugs,
  the current code misses wakeups.

- Patch 2 and 3 restore the behavior of linux <=3.0.9.
  I would propose that they are merged, too: I can't rule out that
  changing the priority of the wakeups breaks user space apps.

- Patch 4 is trivial, no code changes at all.
  If 2+3 are merged, then 4 should be merged, too.

I have tested patch 1 seperately and 1+2+3+4:
With patch 1 applied, there are no more missed wakeups.

With all 4 applied, linux-3.0.10-rc1 behaves as linux <=3.0.9.

With regards to the scalability, I do not expect any degradation:
Operations on seperate semaphores in an array remain parallelized.

Apps that use lots of wait-for-zero semop are probably even faster,
because the wait-for-zero ops are now only scanned if a semval is 0.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue()

2013-05-26 Thread Manfred Spraul
do_smart_update_queue() is called when an operation
(semop, semctl(SETVAL), semctl(SETALL), ...) modified the array.
It must check which of the sleeping tasks can proceed.

do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
  must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue
  must be scanned again

And:
- the test for "|sops == NULL) before scanning the global
  queue is not required: If the global queue is empty, then
  it doesn't need to be scanned - regardless of the reason
  for calling do_smart_update_queue()

The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.
Avoiding unnecessary scans is implemented in the following patches.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index a7e40ed..70480a3 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -752,19 +752,29 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
int otime, struct list_head *pt)
 {
int i;
+   int progress;
 
-   if (sma->complex_count || sops == NULL) {
-   if (update_queue(sma, -1, pt))
+   progress = 1;
+retry_global:
+   if (sma->complex_count) {
+   if (update_queue(sma, -1, pt)) {
+   progress = 1;
otime = 1;
+   sops = NULL;
+   }
}
+   if (!progress)
+   goto done;
 
if (!sops) {
/* No semops; something special is going on. */
for (i = 0; i < sma->sem_nsems; i++) {
-   if (update_queue(sma, i, pt))
+   if (update_queue(sma, i, pt)) {
otime = 1;
+   progress = 1;
+   }
}
-   goto done;
+   goto done_checkretry;
}
 
/* Check the semaphores that were modified. */
@@ -772,8 +782,15 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
if (sops[i].sem_op > 0 ||
(sops[i].sem_op < 0 &&
sma->sem_base[sops[i].sem_num].semval == 0))
-   if (update_queue(sma, sops[i].sem_num, pt))
+   if (update_queue(sma, sops[i].sem_num, pt)) {
otime = 1;
+   progress = 1;
+   }
+   }
+done_checkretry:
+   if (progress) {
+   progress = 0;
+   goto retry_global;
}
 done:
if (otime)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] ipc/sem: seperate wait-for-zero and alter tasks into seperate queues

2013-05-26 Thread Manfred Spraul
Introduce seperate queues for operations that do not modify the
semaphore values.
Advantages:
- Simpler logic in check_restart().
- Faster update_queue(): Right now, all wait-for-zero operations
  are always tested, even if the semaphore value is not 0.
- wait-for-zero gets again priority, as in linux <=3.0.9

Signed-off-by: Manfred Spraul 
---
 include/linux/sem.h |   5 +-
 ipc/sem.c   | 209 +---
 2 files changed, 153 insertions(+), 61 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 53d4265..55e17f6 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -15,7 +15,10 @@ struct sem_array {
time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
-   struct list_headsem_pending;/* pending operations to be 
processed */
+   struct list_headpending_alter;  /* pending operations */
+   /* that alter the array */
+   struct list_headpending_const;  /* pending complex operations */
+   /* that do not alter semvals */
struct list_headlist_id;/* undo requests on this array 
*/
int sem_nsems;  /* no. of semaphores in array */
int complex_count;  /* pending complex operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 70480a3..ce25da3 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -95,7 +95,10 @@ struct sem {
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
-   struct list_head sem_pending; /* pending single-sop operations */
+   struct list_head pending_alter; /* pending single-sop operations */
+   /* that alter the semaphore */
+   struct list_head pending_const; /* pending single-sop operations */
+   /* that do not alter the semaphore*/
 };
 
 /* One queue for each sleeping process in the system. */
@@ -152,7 +155,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 /*
  * linked list protection:
  * sem_undo.id_next,
- * sem_array.sem_pending{,last},
+ * sem_array.pending{_alter,_cont},
  * sem_array.sem_undo: sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
@@ -337,7 +340,7 @@ static inline void sem_rmid(struct ipc_namespace *ns, 
struct sem_array *s)
  * Without the check/retry algorithm a lockless wakeup is possible:
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
- * * unlinking the queue entry from sma->sem_pending
+ * * unlinking the queue entry from the pending list
  * * setting queue.status to IN_WAKEUP
  *   This is the notification for the blocked thread that a
  *   result value is imminent.
@@ -418,12 +421,14 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
sma->sem_base = (struct sem *) &sma[1];
 
for (i = 0; i < nsems; i++) {
-   INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+   INIT_LIST_HEAD(&sma->sem_base[i].pending_alter);
+   INIT_LIST_HEAD(&sma->sem_base[i].pending_const);
spin_lock_init(&sma->sem_base[i].lock);
}
 
sma->complex_count = 0;
-   INIT_LIST_HEAD(&sma->sem_pending);
+   INIT_LIST_HEAD(&sma->pending_alter);
+   INIT_LIST_HEAD(&sma->pending_const);
INIT_LIST_HEAD(&sma->list_id);
sma->sem_nsems = nsems;
sma->sem_ctime = get_seconds();
@@ -609,60 +614,130 @@ static void unlink_queue(struct sem_array *sma, struct 
sem_queue *q)
  * update_queue is O(N^2) when it restarts scanning the whole queue of
  * waiting operations. Therefore this function checks if the restart is
  * really necessary. It is called after a previously waiting operation
- * was completed.
+ * modified the array.
+ * Note that wait-for-zero operations are handled without restart.
  */
 static int check_restart(struct sem_array *sma, struct sem_queue *q)
 {
-   struct sem *curr;
-   struct sem_queue *h;
-
-   /* if the operation didn't modify the array, then no restart */
-   if (q->alter == 0)
-   return 0;
-
-   /* pending complex operations are too difficult to analyse */
-   if (sma->complex_count)
+   /* pending complex alter operations are too difficult to analyse */
+   if (!list_empty(&sma->pending_alter))
return 1;
 
/* we were a sleeping complex o

[PATCH 3/4] ipc/sem.c: Always use only one queue for alter operations.

2013-05-26 Thread Manfred Spraul
There are two places that can contain alter operations:
- the global queue: sma->pending_alter
- the per-semaphore queues: sma->sem_base[].pending_alter.

Since one of the queues must be processed first, this causes an odd
priorization of the wakeups:
Right now, complex operations have priority over simple ops.

The patch restores the behavior of linux <=3.0.9: The longest
waiting operation has the highest priority.

This is done by using only one queue:
- if there are complex ops, then sma->pending_alter is used.
- otherwise, the per-semaphore queues are used.

As a side effect, do_smart_update_queue() becomes much simpler:
No more goto logic.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 128 ++
 1 file changed, 88 insertions(+), 40 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ce25da3..9ed3853 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -192,6 +192,53 @@ void __init sem_init (void)
IPC_SEM_IDS, sysvipc_sem_proc_show);
 }
 
+/**
+ * unmerge_queues - unmerge queues, if possible.
+ * @sma: semaphore array
+ *
+ * The function unmerges the wait queues if complex_count is 0.
+ * It must be called prior to dropping the global semaphore array lock.
+ */
+static void unmerge_queues(struct sem_array *sma)
+{
+   struct sem_queue *q, *tq;
+
+   /* complex operations still around? */
+   if (sma->complex_count)
+   return;
+   /*
+* We will switch back to simple mode.
+* Move all pending operation back into the per-semaphore
+* queues.
+*/
+   list_for_each_entry_safe(q, tq, &sma->pending_alter, list) {
+   struct sem *curr;
+   curr = &sma->sem_base[q->sops[0].sem_num];
+
+   list_add_tail(&q->list, &curr->pending_alter);
+   }
+   INIT_LIST_HEAD(&sma->pending_alter);
+}
+
+/**
+ * merge_queues - Merge single semop queues into global queue
+ * @sma: semaphore array
+ *
+ * This function merges all per-semaphore queues into the global queue.
+ * It is necessary to achieve FIFO ordering for the pending single-sop
+ * operations when a multi-semop operation must sleep.
+ * Only the alter operations must be moved, the const operations can stay.
+ */
+static void merge_queues(struct sem_array *sma)
+{
+   int i;
+   for (i = 0; i < sma->sem_nsems; i++) {
+   struct sem *sem = sma->sem_base + i;
+
+   list_splice_init(&sem->pending_alter, &sma->pending_alter);
+   }
+}
+
 /*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
@@ -262,6 +309,7 @@ static inline int sem_lock(struct sem_array *sma, struct 
sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
if (locknum == -1) {
+   unmerge_queues(sma);
spin_unlock(&sma->sem_perm.lock);
} else {
struct sem *sem = sma->sem_base + locknum;
@@ -829,49 +877,38 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
int otime, struct list_head *pt)
 {
int i;
-   int progress;
 
otime |= do_smart_wakeup_zero(sma, sops, nsops, pt);
 
-   progress = 1;
-retry_global:
-   if (sma->complex_count) {
-   if (update_queue(sma, -1, pt)) {
-   progress = 1;
-   otime = 1;
-   sops = NULL;
-   }
-   }
-   if (!progress)
-   goto done;
-
-   if (!sops) {
-   /* No semops; something special is going on. */
-   for (i = 0; i < sma->sem_nsems; i++) {
-   if (update_queue(sma, i, pt)) {
-   otime = 1;
-   progress = 1;
+   if (!list_empty(&sma->pending_alter)) {
+   /* semaphore array uses the global queue - just process it. */
+   otime |= update_queue(sma, -1, pt);
+   } else {
+   if (!sops) {
+   /*
+* No sops, thus the modified semaphores are not
+* known. Check all.
+*/
+   for (i = 0; i < sma->sem_nsems; i++)
+   otime |= update_queue(sma, i, pt);
+   } else {
+   /*
+* Check the semaphores that were increased:
+* - No complex ops, thus all sleeping ops are
+*   decrease.
+* - if we decreased the value, then any sleeping
+*   semaphore ops wont be able to run: If the
+*   previous value was too small, then the new
+

[PATCH 4/4] ipc/sem.c: Rename try_atomic_semop() to perform_atomic_semop(), docu update

2013-05-26 Thread Manfred Spraul
Final cleanup: Some minor points that I noticed while writing the
previous 3 patches

1) The name try_atomic_semop() is misleading: The function performs the
   operation (if it is possible).

2) Some documentation updates.

No real code change, a rename and documentation changes.

Signed-off-by: Manfred Spraul 
---
 ipc/sem.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 9ed3853..1dbb2fa 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -153,12 +153,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define SEMOPM_FAST64  /* ~ 372 bytes on stack */
 
 /*
- * linked list protection:
+ * Locking:
  * sem_undo.id_next,
+ * sem_array.complex_count,
  * sem_array.pending{_alter,_cont},
- * sem_array.sem_undo: sem_lock() for read/write
+ * sem_array.sem_undo: global sem_lock() for read/write
  * sem_undo.proc_next: only "current" is allowed to read/write that field.
  * 
+ * sem_array.sem_base[i].pending_{const,alter}:
+ * global or semaphore sem_lock() for read/write
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -535,12 +538,19 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, 
semflg)
return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
 }
 
-/*
- * Determine whether a sequence of semaphore operations would succeed
- * all at once. Return 0 if yes, 1 if need to sleep, else return error code.
+/** perform_atomic_semop - Perform (if possible) a semaphore operation
+ * @sma: semaphore array
+ * @sops: array with operations that should be checked
+ * @nsems: number of sops
+ * @un: undo array
+ * @pid: pid that did the change
+ *
+ * Returns 0 if the operation was possible.
+ * Returns 1 if the operation is impossible, the caller must sleep.
+ * Negative values are error codes.
  */
 
-static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
+static int perform_atomic_semop(struct sem_array *sma, struct sembuf *sops,
 int nsops, struct sem_undo *un, int pid)
 {
int result, sem_op;
@@ -723,8 +733,8 @@ static int wake_const_ops(struct sem_array *sma, int semnum,
q = container_of(walk, struct sem_queue, list);
walk = walk->next;
 
-   error = try_atomic_semop(sma, q->sops, q->nsops,
-   q->undo, q->pid);
+   error = perform_atomic_semop(sma, q->sops, q->nsops,
+q->undo, q->pid);
 
if (error <= 0) {
/* operation completed, remove from queue & wakeup */
@@ -835,7 +845,7 @@ again:
if (semnum != -1 && sma->sem_base[semnum].semval == 0)
break;
 
-   error = try_atomic_semop(sma, q->sops, q->nsops,
+   error = perform_atomic_semop(sma, q->sops, q->nsops,
 q->undo, q->pid);
 
/* Does q->sleeper still need to sleep? */
@@ -1658,7 +1668,6 @@ static int get_queue_result(struct sem_queue *q)
return error;
 }
 
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1756,7 +1765,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
if (un && un->semid == -1)
goto out_unlock_free;
 
-   error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
+   error = perform_atomic_semop(sma, sops, nsops, un,
+   task_tgid_vnr(current));
if (error <= 0) {
if (alter && error == 0)
do_smart_update(sma, sops, nsops, 1, &tasks);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] ipc/sem.c: Bug fixes, regression fixes, v3

2013-05-27 Thread Manfred Spraul

Hi Davidlohr,

On 05/26/2013 10:50 PM, Davidlohr Bueso wrote:


In lack of getting my swingbench DSS environment back, I ran these
changes against the semop-multi program on my laptop. For 256 threads,
with Manfred's patchset the ops/sec suffers around -7.3%.


Could you also check the performance of only patch#1?
I fear that it might be slower than all 4 together.

With regards to semop-multi:
Is this the tool?
http://marc.info/?l=linux-kernel&m=136208613626892&q=p3

I think the logic is the wrong:
Locking a semaphore is substraction, unlocking adding.
Thus multiple tasks can run in parallel - and the task switch code is 
never triggered.
Could you double check that the number of context switches matches the 
output?


I usually use this tool:
http://marc.info/?l=linux-kernel&m=125038376609750


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] ipc/sem.c: Fix missing wakeups in do_smart_update_queue()

2013-05-27 Thread Manfred Spraul

Hi Rik,

On 05/27/2013 07:51 PM, Rik van Riel wrote:


Very much not optimized, but we need to fix the regression.

Acked-by: Rik van Riel 


Thanks for the review.

I was afraid that the differences would be noticeable but they aren't:
At least on my laptop, the differences between the patches are within 
the noise:


# ./cat /proc/stat | grep ctxt;./osim 2 2 100 0 0;cat /proc/stat | 
grep ctxt

runtime: stockpatch 1patch 1-4
2,00888752,0410841,997367

# ./cat /proc/stat | grep ctxt;./osim 2 20 100 0 0;cat /proc/stat | 
grep ctxt

4,325414,2959564,277817

(assuming 2 cpus, otherwise increase "2" and "20" accordingly)

Here is the link to the tool:
http://marc.info/?l=linux-kernel&m=125038376609750

--
Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: prevent ENOMEM in semop() w/ SEM_UNDO flag

2012-08-03 Thread Manfred Spraul
Hi Seiichi,

On 08/03/2012 02:49 PM, Seiichi Ikarashi wrote:
> semop() with SEM_UNDO sem_flg can result in ENOMEM even after
> succeeding semget() with large nsems.
How large is nsems, what is the use case?
Which kind of operations are performed?
Only simple semop(,,1) calls?

 still documents ~8000 as the upper limit, I'm not sure if
there are other codepaths that might fail as well.
If all are fixed, then the documentation should be updated as well.

>  This is because
> semop() uses kzalloc() via find_alloc_undo() though
> semget() uses vmalloc() via ipc_rcu_alloc().
> This patch makes semop() be able to use vmalloc() via ipc_alloc().
>
> Signed-off-by: Seiichi Ikarashi 

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: prevent ENOMEM in semop() w/ SEM_UNDO flag

2012-08-07 Thread Manfred Spraul
Hi Seiichi,

2012/8/6 Seiichi Ikarashi 
>
>
>  A real case was as follows.
>  semget(IPC_PRIVATE, 7, IPC_CREAT | IPC_EXCL);
>  sops[0].sem_num = 0;
>  sops[0].sem_op  = 1;
>  sops[0].sem_flg = SEM_UNDO;
>  semop(semid, sops, 1);
>

I think this can't work: sops[].sem_num is defined as "unsigned short".
Thus more than 65500 semaphores in one semaphore set do not make
any sense.
"unsigned short" is also specified in the opengroup standard:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/syssem.h.html

Thus: The hard limit is 65535. Perhaps slightly less, I haven't checked
if (-1) is used somewhere to indicate an error.

Is it possible to split the semaphores into multiple semphore ids?
e.g. 70 ids, each with 1000 semaphores?

The atomicity would be lost (e.g. all SEM_UNDO operations within
one id are performed at once. With 70 ids, the SEM_UNDOs are not
atomic anymore)

>
>#define SEMMSL  250 /* <= 8 000 max num of semaphores per id */
>

As far as I can see your patch removes the last part of the code that
caused the restriction to 8.000 semaphores per id.

Thus I'd propose that your patch changes this line to

+ #define SEMMSL  250 /* <= 65 500 max num of semaphores per id */

And:
I would add a comment into the patch description all semaphores
from one id share a single kernel spinlock. This could be changed, but
it would
a) add complexity for all users and
b) change user space visible behavior
Thus I would prefer to avoid to implement it unless there are real
applications that need this implementation.

--
  Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 08/10] IPC: message queue copy feature introduced

2012-08-11 Thread Manfred Spraul
Hi Stanislav,

2012/8/10 Stanislav Kinsbursky :
> This patch is required for checkpoint/restore in userspace.
> IOW, c/r requires some way to get all pending IPC messages without deleting
> them from the queue (checkpoint can fail and in this case tasks will be 
> resumed,
> so queue have to be valid).
> To achive this, new operation flag MSG_COPY for sys_msgrcv() system call was
> introduced. Also, copy counter was added to msg_queue structure. It's set to
> zero by default and increases by one on each copy operation and decreased by
> one on each receive operation until reaches zero.

Is msq->q_copy_cnt really necessary?
As far as I can see user space needs the ability to read the n-th message.

The implementation adds a state variable to the kernel, adds two
automatic updates of the state into msgrcv() (an increase during
MSG_COPY, a decrease during normal receives) and adds a msgctl() to
set the state to a certain value.

a) What about the simpler approach:
- if MSG_COPY is set, then @mtype is interpreted as the number of the
message that should be copied.
  If there are less than @mtype messages, then -ENOMSG is returned.

b) I do not understand the purpose of the decrease of msq->q_copy_cnt:
Do you want to handle normal msgrcv() calls in parallel with
msgrcv(|MSG_COPY) calls?
I don't think that this will work:
What if msq->q_copy_cnt is 1 and and msgrcv() call receives the 20th
message in the queue?

--
  Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 08/10] IPC: message queue copy feature introduced

2012-08-12 Thread Manfred Spraul
Hi Stanislav,

2012/8/11 Stanislav Kinsbursky :
> 11.08.2012 15:20, Manfred Spraul пишет:
>> a) What about the simpler approach:
>> - if MSG_COPY is set, then @mtype is interpreted as the number of the
>> message that should be copied.
>>If there are less than @mtype messages, then -ENOMSG is returned.
>
>
> Hi, Manfred.
> Your approach is simplier, but makes the call less generic and adds
> limitations.
> I.e. sys_msgrcv() allows you to receive message by type. And from my pow
> this logic have to be preserved - you can specify type and then copy all the
> messages of specified type.
>

Your implementation adds the ability to select a message for MSG_COPY by id.
But I think the price is way too high:
a) added complexity
b) I didn't notice it immediately:
The implementation means that MSG_COPY cannot be used at all by
multiple processes:

task 1:  msgrcv(id,buf,len,0,MSG_COPY).
task 2:  msgrcv(id,buf,len,0,MSG_COPY).

It is unpredictable if a task receives the first or the 2nd message
from the queue.

task 1:  int msgnr=0;
msgctl(id,MSG_SET_COPY,&msgnr)
msgrcv(id,buf,len,0,MSG_COPY).
task 2:  int msgnr=0;
msgctl(id,MSG_SET_COPY,&msgnr)
msgrcv(id,buf,len,0,MSG_COPY).

Doesn't work either, it's a race condition.

>
>> b) I do not understand the purpose of the decrease of msq->q_copy_cnt:
>> Do you want to handle normal msgrcv() calls in parallel with
>> msgrcv(|MSG_COPY) calls?
>
>
> Actually, I'm not going to copy a message from a queue, when somebody is
> reading from it. But better to handle this case by decreasing
> msq->q_copy_cnt, because otherwise this counter becomes invalid in case of
> somebody is reading from queue. And this logic is similar to new "peek"
> logic for sockets (introduced in 3.4 or 3.5).
> But I understand, that in case of queue with messages with different types
> this approach works only if mtype is not specified for copy operation.
> Otherwise result is unpredictable.
>
a) If the result is unpredictable when mtype is used, does it make
   sense to implement msgcrl(id,buf,len,id=,MSG_COPY)?

b) The result is also unpredictable if mtype is used for the "normal"
   msgrcv() (i.e. without MSG_COPY) call.
>
>> I don't think that this will work:
>> What if msq->q_copy_cnt is 1 and and msgrcv() call receives the 20th
>> message in the queue?
>
>
> By "receives" you mean "copied"? If so, then it can happen only if mtype was
> specified. And this logic is a part of current implementation.
>

I was thinking about the following case:

task 1: /* copy all entries */
  errno=0;
  for (i=0;errno<>0;i++)
 msgrcv(msgid,buf[i],buflen,0,MSG_COPY);

task 2: /* receive a message with a given ID */
  msgrcv(msqid,buf,buflen,123,0);

Now suppose that task 1 has read 5 messages from the queue and then
task 2 tries to receive the message with the id=123. Suppose this is
the 20th message in the queue.

Result: task 1 will copy the message 5 twice.

I would keep it simple - unless there is a clear use case where "peek
by id" is useful.

Or - since MSG_COPY is linux specific anyway:
What about storing the number of the message that should be returned in *msgp?
Store it as "int64", just to avoid any 32/64 bit issues.

--
  Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: uninteruptable sleep

2001-04-03 Thread Manfred Spraul

> ps xl:
>   F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
> 040 1000 1230 1 9 0 24320 4 down_w D ? 0:00
>   /home/data/mozilla/obj/dist/bin/mozi
>
down_w

Perhaps down_write_failed()? 2.4.3 converted the mmap semaphore to a
rw-sem.
Did you compile sysrq into your kernel? Then enable it with

#echo 1 > /proc/sys/kernel/sysrq
and press ++'t'

It prints the complete back trace, not just one function name

--
Manfred



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: softirq buggy [Re: Serial port latency]

2001-04-04 Thread Manfred Spraul

From: "Pavel Machek" <[EMAIL PROTECTED]>
>
> > Ok, there are 2 bugs that are (afaics) impossible to fix without
> > checking for pending softirq's in cpu_idle():
> >
> > a)
> > queue_task(my_task1, tq_immediate);
> > mark_bh();
> > schedule();
> > ;within schedule: do_softirq()
> > ;within my_task1:
> > mark_bh();
> > ; bh returns, but do_softirq won't loop
> > ; do_softirq returns.
> > ; schedule() clears current->need_resched
> > ; idle thread scheduled.
> > --> idle can run although softirq's are pending
>
> Or anything else can run altrough softirqs are pending. If it is
> computation job, softinterrupts are delayed quiet a bit, right?
>
> So right fix seems to be "loop in do_softirq".
>
No, it's the wrong fix.
A network server under high load would loop forever within the softirq,
never returning to process level.

do_softirq cannot loop, the right fix is "check often for pending
softirq's".
It's checked before a process returns to user space, it's checked when a
process schedules. What's missing is that the idle functions must check
for pending softirqs, too.

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: linux 2.2.19 smp interrupt problems?

2001-04-07 Thread Manfred Spraul

> 
> everything seems to work fine. are these interrupt problems "bad" or merely 
> indicators of a high load. can/should one get rid of them? 
>
Could you try the 8139too driver?

It's a bug in the rtl8139 driver, and under really high load it can
cause crashes.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: softirq buggy [Re: Serial port latency]

2001-04-08 Thread Manfred Spraul

Pavel Machek wrote:
> 
> Ok. I was missing the fact it is checked on going userspace.
>

What about the attached patch?

--
Manfred

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 3
//  EXTRAVERSION = -ac3
--- 2.4/include/linux/interrupt.h   Thu Feb 22 22:29:53 2001
+++ build-2.4/include/linux/interrupt.h Sat Apr  7 23:51:31 2001
@@ -269,4 +269,25 @@
 extern int probe_irq_off(unsigned long);   /* returns 0 or negative on failure */
 extern unsigned int probe_irq_mask(unsigned long); /* returns mask of ISA 
interrupts */
 
+/**
+ * cpu_is_idle - helper function for idle functions
+ * 
+ * pm_idle functions must call this function to verify that
+ * the cpu is really idle. It must be called with disabled
+ * local interrupts.
+ * Return values:
+ * 0: cpu was not idle.
+ * 1: go into power saving mode.
+ */
+static inline int cpu_is_idle(void)
+{
+   if (current->need_resched) {
+   return 0;
+   }
+   if (softirq_active(smp_processor_id()) & softirq_mask(smp_processor_id())) {
+   do_softirq();
+   return 0;
+   }
+   return 1;
+}
 #endif
--- 2.4/arch/i386/kernel/process.c  Thu Feb 22 22:28:52 2001
+++ build-2.4/arch/i386/kernel/process.cSat Apr  7 23:51:46 2001
@@ -73,6 +73,7 @@
hlt_counter--;
 }
 
+
 /*
  * We use this if we don't have any better
  * idle routine..
@@ -81,7 +82,7 @@
 {
if (current_cpu_data.hlt_works_ok && !hlt_counter) {
__cli();
-   if (!current->need_resched)
+   if (cpu_is_idle())
safe_halt();
else
__sti();
@@ -92,26 +93,21 @@
  * On SMP it's slightly faster (but much more power-consuming!)
  * to poll the ->need_resched flag instead of waiting for the
  * cross-CPU IPI to arrive. Use this option with caution.
+ *
+ * Isn't this identical to default_idle with the 'no-hlt' boot
+ * option? <[EMAIL PROTECTED]>
  */
 static void poll_idle (void)
 {
int oldval;
 
+   for(;;) {
+   if (!cpu_is_idle())
+   break;
+   __sti();
+   rep_nop();
+   }
__sti();
-
-   /*
-* Deal with another CPU just having chosen a thread to
-* run here:
-*/
-   oldval = xchg(¤t->need_resched, -1);
-
-   if (!oldval)
-   asm volatile(
-   "2:"
-   "cmpl $-1, %0;"
-   "rep; nop;"
-   "je 2b;"
-   : :"m" (current->need_resched));
 }
 
 /*
--- 2.4/drivers/acpi/cpu.c  Sat Apr  7 22:02:01 2001
+++ build-2.4/drivers/acpi/cpu.cSat Apr  7 23:55:17 2001
@@ -148,7 +148,7 @@
unsigned long diff;

__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
if (acpi_bm_activity())
goto sleep2;
@@ -171,7 +171,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
if (acpi_bm_activity())
goto sleep2;
@@ -205,7 +205,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
 
time = acpi_read_pm_timer();
@@ -235,7 +235,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
time = acpi_read_pm_timer();
acpi_c1_count++;



rep nop question

2001-04-08 Thread Manfred Spraul

The Intel documentation recommends that spinlocks should use

loop:
rep;nop;
cmp $0,lock_var
jne loop

ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_spinlock.pdf

but the linux spinlock implementation uses

loop:
cmp $0, lock_var
rep; nop;
jne loop.

Why?
According to the Intel documentation, rep nop is a predefined delay that
slows down the cpu to the speed of the memory bus. The linux
implementation will delay again _after_ the lock was released.

What about the attached patch? It also adds 'rep;nop' into the rw
spinlock implementation.

It boots with my Pentium III, but the cpu doesn't support 'rep nop'
(i.e. returns immediately)

--
Manfred

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 3
//  EXTRAVERSION = -ac3
--- 2.4/arch/i386/kernel/semaphore.cSun Nov 19 02:31:25 2000
+++ build-2.4/arch/i386/kernel/semaphore.c  Sat Apr  7 21:26:35 2001
@@ -430,7 +430,8 @@
 .globl __write_lock_failed
 __write_lock_failed:
" LOCK "addl$" RW_LOCK_BIAS_STR ",(%eax)
-1: cmpl$" RW_LOCK_BIAS_STR ",(%eax)
+1: rep; nop
+   cmpl$" RW_LOCK_BIAS_STR ",(%eax)
jne 1b
 
" LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax)
@@ -442,7 +443,8 @@
 .globl __read_lock_failed
 __read_lock_failed:
lock ; incl (%eax)
-1: cmpl$1,(%eax)
+1: rep; nop
+   cmpl$1,(%eax)
js  1b
 
lock ; decl (%eax)
--- 2.4/include/asm-i386/spinlock.h Sat Apr  7 22:02:23 2001
+++ build-2.4/include/asm-i386/spinlock.h   Sat Apr  7 22:01:43 2001
@@ -58,10 +58,10 @@
"js 2f\n" \
".section .text.lock,\"ax\"\n" \
"2:\t" \
-   "cmpb $0,%0\n\t" \
"rep;nop\n\t" \
-   "jle 2b\n\t" \
-   "jmp 1b\n" \
+   "cmpb $0,%0\n\t" \
+   "jg 1b\n\t" \
+   "jmp 2b\n" \
".previous"
 
 /*



Re: P-III Oddity.

2001-04-08 Thread Manfred Spraul

Just a guess:

Perhaps one bios is older and contains an older microcode patch?

Have you tried the microcode driver?

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: softirq buggy [Re: Serial port latency]

2001-04-08 Thread Manfred Spraul

From: <[EMAIL PROTECTED]>
> Hello!
>
> > + if (softirq_active(smp_processor_id()) &
softirq_mask(smp_processor_id())) {
> > + do_softirq();
> > + return 0;
>
> BTW you may delete do_softirq()... schedule() will call this.
>

But with a huge overhead. I'd prefer to call it directly from within the
idle functions, the overhead of schedule is IMHO too high.

>
> > + *
> > + * Isn't this identical to default_idle with the 'no-hlt' boot
> > + * option? <[EMAIL PROTECTED]>
>
> Seeems, it is not. need_resched=-1 avoids useless IPIs.
>
I already wondered why need_resched is set to -1 ;-)

I'll remove that comment and repost the patch.

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: softirq buggy [Re: Serial port latency]

2001-04-08 Thread Manfred Spraul

From: <[EMAIL PROTECTED]>
To: "Manfred Spraul" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Sunday, April 08, 2001 7:58 PM
Subject: Re: softirq buggy [Re: Serial port latency]


> Hello!
>
> > But with a huge overhead. I'd prefer to call it directly from within
the
> > idle functions, the overhead of schedule is IMHO too high.
>
>
> + if (current->need_resched) {
> + return 0;
> 
> + }
> + if (softirq_active(smp_processor_id()) &
softirq_mask(smp_processor_id())) {
> + do_softirq();
> + return 0;
> ^
> You return one value in both casesand I decided it means "schedule".
8)
> Apparently you meaned return 1 in the first case. 8)
>
No, the code is correct. 0 means "don't stop the cpu".
The pm_idle function pointer will return to cpu_idle()
(arch/i386/kernel/process.c), and that function contains another

while(!current->need_resched)
idle();

loop ;-)

> But in this case it becomes wrong. do_softirq() can raise need_reshed
> and moreover irqs arrive during it. Order of check should be
different.
>
Yes, I'll correct that.

>
> BTW what's about overhead... I suspect it is _lower_ in the case
> of schedule(). In the case of networking at least, when softirq
> most likely wakes some socket.
>
I'm not sure - what if the computer is just a router?
But OTHO: the cpu is idle, so it doesn't matter at all if the idle cpu
spends it's time within schedule() or within safe_hlt(), I'll change my
patch.

I have another question:
I added cpu_is_idle() into . Is that acceptable, or
is there a better header file for such a function?

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Re: softirq buggy

2001-04-08 Thread Manfred Spraul

I've attached a new patch:

* cpu_is_idle() moved to 
* function uninlined due to header dependencies
* cpu_is_idle() doesn't call do_softirq directly, instead the caller
returns to schedule()
* cpu_is_idle() exported for modules.
* docu updated.

I'd prefer to inline cpu_is_idle(), but optimizing the idle code path is
probably not that important ;-)

--
Manfred

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 3
//  EXTRAVERSION = -ac3
--- 2.4/include/linux/pm.h  Thu Jan  4 23:50:47 2001
+++ build-2.4/include/linux/pm.hSun Apr  8 21:02:02 2001
@@ -186,6 +186,8 @@
 extern void (*pm_idle)(void);
 extern void (*pm_power_off)(void);
 
+int cpu_is_idle(void);
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_PM_H */
--- 2.4/kernel/sched.c  Sat Apr  7 22:02:27 2001
+++ build-2.4/kernel/sched.cSun Apr  8 21:02:37 2001
@@ -1242,6 +1242,28 @@
sched_data->last_schedule = get_cycles();
 }
 
+/**
+ * cpu_is_idle - helper function for idle functions
+ * 
+ * pm_idle functions must call this function to verify that
+ * the cpu is really idle.
+ * The return value is valid until local interrupts are
+ * reenabled.
+ * Return values:
+ * 1: go into power saving mode.
+ * 0: cpu is not idle, return to schedule()
+ */
+int cpu_is_idle(void)
+{
+   if (current->need_resched)
+   return 0;
+
+   if (softirq_active(smp_processor_id()) & softirq_mask(smp_processor_id()))
+   return 0;
+
+   return 1;
+}
+
 extern void init_timervecs (void);
 
 void __init sched_init(void)
--- 2.4/kernel/ksyms.c  Sat Apr  7 22:02:27 2001
+++ build-2.4/kernel/ksyms.cSun Apr  8 21:42:48 2001
@@ -440,6 +440,7 @@
 #endif
 EXPORT_SYMBOL(kstat);
 EXPORT_SYMBOL(nr_running);
+EXPORT_SYMBOL(cpu_is_idle);
 
 /* misc */
 EXPORT_SYMBOL(panic);
--- 2.4/arch/i386/kernel/process.c  Thu Feb 22 22:28:52 2001
+++ build-2.4/arch/i386/kernel/process.cSun Apr  8 21:25:23 2001
@@ -81,7 +81,7 @@
 {
if (current_cpu_data.hlt_works_ok && !hlt_counter) {
__cli();
-   if (!current->need_resched)
+   if (cpu_is_idle())
safe_halt();
else
__sti();
@@ -105,13 +105,10 @@
 */
oldval = xchg(¤t->need_resched, -1);
 
-   if (!oldval)
-   asm volatile(
-   "2:"
-   "cmpl $-1, %0;"
-   "rep; nop;"
-   "je 2b;"
-   : :"m" (current->need_resched));
+   if (!oldval) {
+   while(cpu_is_idle())
+   rep_nop();
+   }
 }
 
 /*
@@ -131,7 +128,7 @@
void (*idle)(void) = pm_idle;
if (!idle)
idle = default_idle;
-   while (!current->need_resched)
+   while (cpu_is_idle())
idle();
schedule();
check_pgt_cache();
--- 2.4/drivers/acpi/cpu.c  Sat Apr  7 22:02:01 2001
+++ build-2.4/drivers/acpi/cpu.cSat Apr  7 23:55:17 2001
@@ -148,7 +148,7 @@
unsigned long diff;

__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
if (acpi_bm_activity())
goto sleep2;
@@ -171,7 +171,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
if (acpi_bm_activity())
goto sleep2;
@@ -205,7 +205,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
 
time = acpi_read_pm_timer();
@@ -235,7 +235,7 @@
unsigned long diff;
 
__cli();
-   if (current->need_resched)
+   if (!cpu_is_idle())
goto out;
time = acpi_read_pm_timer();
acpi_c1_count++;




cpu binding bug?

2001-04-08 Thread Manfred Spraul

Afaics cpu bindings could allow a thread to run with an "unlimited"
timeslice.

cpu0: one thread running with 'nice==19'.
NICE_TO_TICKS==1.

cpu1: lots of other threads with 'nice==0' (NICE_TO_TICKS==6)

cpu0 will enter schedule() every tick.
can_schedule() returns '0' for all threads bound to cpu1, thus
'recalculate' will be called every tick - that's faster than the threads
on cpu1 can use up their slice.
The currently running thread will run forever.

Is that problem fixed in the patches that allow user space to write
cpus_allowed?

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Re: softirq buggy

2001-04-09 Thread Manfred Spraul

Andrea Arcangeli wrote:
> 
> your cpu_is_idle will return 0 in the need_resched != 0 check even if the cpu
> is idle (because of the -1 trick for avoiding the SMP-IPI to notify the cpu).
>
Fixed.

> The issue you are addressing is quite londstanding and it is not only related
> to the loop with an idle cpu.
> 
> This is the way I prefer to fix it:
> 
> 
>ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.4pre1/ksoftirqd-1
>
The return path to user space checks for pending softirqs. A delay of
1/HZ is only possible if the cpu loops in kernel space without returning
to user space - and the functions that can loop check
'current->need_resched'. That means that either cpu_is_idle() must be
renamed to schedule_required() and all 'need_resched' users should use
that function, or something like your patch.

Is a full thread really necessary? Just setting 'need_resched' should be
enough, schedule() checks for pending softirqs.
And do you have a rough idea how often that new thread is scheduled
under load?

Btw, you don't schedule the ksoftirqd thread if do_softirq() returns
from the 'if(in_interrupt())' check.
I assume that this is the most common case of delayed softirq
processing:

; in process context
spin_lock_bh();
; hw interrupt arrives
; do_softirq returns immediately
spin_unlock_bh();


--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: skb allocation problems (More Brain damage!)

2001-04-11 Thread Manfred Spraul

> it is very hard to imagine the scenario which can lead to this...
> I will try your suggestion..

Perhaps a problem with the csum assembler implementations? Which cpu
type do you optimize for, and which cpu is installed?

Btw, are you overclocking anything?

--
Manfred




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: No one wants to help me :-(

2001-04-13 Thread Manfred Spraul

> I was expecting to receive some replies to my last desperate messages:
>
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg35446.html
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg36591.html
>

If you don't receive an answer, then it either indicates that the bug is
too obvious or too difficult ;-)

>
>  Maybe someone can see which is the real bug and fix it.
> Please help!
>
Has debug_timer_list() caught a few corrupted lists, could you post the
results?

--
Manfred
P.S.: Outlook decided to ignore the 'send later', and I couldn't stop
the empty message. Sorry.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: No one wants to help me :-(

2001-04-13 Thread Manfred Spraul

> I was expecting to receive some replies to my last desperate messages:

>
[EMAIL PROTECTED]/msg35446.html">http://www.mail-archive.com/
[EMAIL PROTECTED]/msg35446.html
[EMAIL PROTECTED]/msg36591.html">http://www.mail-archive.com/
[EMAIL PROTECTED]/msg36591.html


> Maybe someone can see which is the real bug and fix it.

> Please help!



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 8139too: defunct threads

2001-04-14 Thread Manfred Spraul

>> Ah. Of course. All (or most) kernel initialisation is
>> done by PID 1. Search for "kernel_thread" in init/main.c
>>
>> So it seems that in your setup, process 1 is not reaping
>> children, which is why this hasn't been reported before.
>> Is there something unusual about your setup?

> I found the difference which causes this. If I build my kernel with
> IP_PNP (IP: kernel level autoconfiguration) support I get a defunt
> thread for each 8139too device. If I don't build with IP_PNP
> support I don't get any, defunct ethernet threads.

Does init(8) reap children that died before it was spawned? I assume
that the defunct tasks were there _before_ init was spawned.

Perhaps init() [in linux/init/main.c] should reap all defunct tasks
before the execve("/sbin/init").

I've attached an untested patch, could you try it?

--
Manfred


 patch-main.dat


[PATCH] Re: 8139too: defunct threads

2001-04-14 Thread Manfred Spraul

Hi Alan,

Rod's init version (from RH 7.0) doesn't reap children that died before
it was started. Is that an init bug or should the kernel reap them
before the execve?
The attached patch reaps all zombies before the execve("/sbin/init").

I also found a bug in kernel/context.c: it doesn't acquire the sigmask
spinlock around the call to recalc_sigpending.

Rod Stewart wrote:
> 
> Yes, that fixes my problem.  No more defunct eth? processes when IP_PNP is
> compiled in.  With the fix you said to the patch; replacing curtask with
> current.
>
Fortunately you don't use SMP - spin_lock_irq();...;spin_lock_irq()
instead of spin_lock_irq();...;spin_unlock_irq();

--
Manfred

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 3
//  EXTRAVERSION = -ac3
--- 2.4/init/main.c Sat Apr  7 22:02:27 2001
+++ build-2.4/init/main.c   Sat Apr 14 19:18:34 2001
@@ -883,6 +883,13 @@
 
(void) dup(0);
(void) dup(0);
+
+   while (waitpid(-1, (unsigned int *)0, __WALL|WNOHANG) > 0)
+   ;
+   spin_lock_irq(¤t->sigmask_lock);
+   flush_signals(current);
+   recalc_sigpending(current);
+   spin_unlock_irq(¤t->sigmask_lock);

/*
 * We try each of these until one succeeds.
--- 2.4/kernel/context.cFri Feb  2 15:20:37 2001
+++ build-2.4/kernel/context.c  Sat Apr 14 19:09:10 2001
@@ -101,8 +101,10 @@
if (signal_pending(curtask)) {
while (waitpid(-1, (unsigned int *)0, __WALL|WNOHANG) > 0)
;
+   spin_lock_irq(&curtask->sigmask_lock);
flush_signals(curtask);
recalc_sigpending(curtask);
+   spin_unlock_irq(&curtask->sigmask_lock);
}
}
 }





Re: [PATCH] Re: 8139too: defunct threads

2001-04-14 Thread Manfred Spraul

From: "Alan Cox" <[EMAIL PROTECTED]>
>
> That has an implicit race, a zombie can always appear as we are
execing init.
> I think init wants fixing
>
Rod, could you boot again with the unpatched kernel and send a sigchild
to init?

#kill -CHLD 1

If I understand the init code correctly the sigchild handler reaps all
zombies, but probably the signal got lost because the children died
before the parent was created ;-)

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[new PATCH] Re: 8139too: defunct threads

2001-04-15 Thread Manfred Spraul

I found the problem:

* init uses waitpid(-1,,), thus the __WALL flag is not set
* without __WALL, only processes with exit_signal == SIGCHLD are reaped
* it's impossible for user space processes to die with another
exit_signal, forget_original_parent changes the exit_signal back to
SIGCHLD ("We dont want people slaying init"), and init itself doesn't
use clone.
* kernel threads can die with an arbitrary exit_signal.

Alan, which fix would you prefer:
* init could use wait3 and set __WALL.
* all kernel thread users could set SIGCHLD. Some already do that
(__call_usermodehelper).
* the kernel_thread implementations could force the exit signal to
SIGCHLD.

I'd prefer the last version. 
The attached patch is tested with i386. The alpha, parisc and ppc
assember changes are guessed.

--
Manfred

diff -ur 2.4/arch/alpha/kernel/entry.S build-2.4/arch/alpha/kernel/entry.S
--- 2.4/arch/alpha/kernel/entry.S   Sun Sep  3 20:36:45 2000
+++ build-2.4/arch/alpha/kernel/entry.S Sun Apr 15 14:58:01 2001
@@ -242,12 +242,12 @@
subq$30,4*8,$30
stq $10,16($30)
stq $9,8($30)
-   lda $0,CLONE_VM
+   lda $0,CLONE_VM|SIGCHLD
stq $26,0($30)
.prologue 1
mov $16,$9  /* save fn */   
mov $17,$10 /* save arg */
-   or  $18,$0,$16  /* shuffle flags to front; add CLONE_VM.  */
+   or  $18,$0,$16  /* shuffle flags to front; add CLONE_VM|SIGCHLD. */
bsr $26,kernel_clone
bne $20,1f  /* $20 is non-zero in child */
ldq $26,0($30)
diff -ur 2.4/arch/arm/kernel/process.c build-2.4/arch/arm/kernel/process.c
--- 2.4/arch/arm/kernel/process.c   Thu Feb 22 22:28:51 2001
+++ build-2.4/arch/arm/kernel/process.c Sun Apr 15 14:51:08 2001
@@ -368,6 +368,8 @@
 {
pid_t __ret;
 
+   flags |= SIGCHLD;
+
__asm__ __volatile__(
"orrr0, %1, %2  @ kernel_thread sys_clone
mov r1, #0
diff -ur 2.4/arch/cris/kernel/process.c build-2.4/arch/cris/kernel/process.c
--- 2.4/arch/cris/kernel/process.c  Sat Apr  7 22:01:49 2001
+++ build-2.4/arch/cris/kernel/process.cSun Apr 15 14:51:16 2001
@@ -127,6 +127,8 @@
 int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
 {
register long __a __asm__ ("r10");
+
+   flags |= SIGCHLD;

__asm__ __volatile__
("movu.w %1,r9\n\t" /* r9 contains syscall number, to sys_clone */
diff -ur 2.4/arch/i386/kernel/process.c build-2.4/arch/i386/kernel/process.c
--- 2.4/arch/i386/kernel/process.c  Thu Feb 22 22:28:52 2001
+++ build-2.4/arch/i386/kernel/process.cSun Apr 15 14:40:43 2001
@@ -440,6 +440,8 @@
 {
long retval, d0;
 
+   flags |= SIGCHLD;
+
__asm__ __volatile__(
"movl %%esp,%%esi\n\t"
"int $0x80\n\t" /* Linux/i386 system call */
diff -ur 2.4/arch/ia64/kernel/process.c build-2.4/arch/ia64/kernel/process.c
--- 2.4/arch/ia64/kernel/process.c  Thu Jan  4 21:50:17 2001
+++ build-2.4/arch/ia64/kernel/process.cSun Apr 15 14:51:44 2001
@@ -500,7 +500,7 @@
struct task_struct *parent = current;
int result, tid;
 
-   tid = clone(flags | CLONE_VM, 0);
+   tid = clone(flags | CLONE_VM | SIGCHLD, 0);
if (parent != current) {
result = (*fn)(arg);
_exit(result);
diff -ur 2.4/arch/m68k/kernel/process.c build-2.4/arch/m68k/kernel/process.c
--- 2.4/arch/m68k/kernel/process.c  Thu Feb 22 22:28:54 2001
+++ build-2.4/arch/m68k/kernel/process.cSun Apr 15 14:51:58 2001
@@ -135,7 +135,7 @@
 
{
register long retval __asm__ ("d0");
-   register long clone_arg __asm__ ("d1") = flags | CLONE_VM;
+   register long clone_arg __asm__ ("d1") = flags | CLONE_VM | SIGCHLD;
 
__asm__ __volatile__
  ("clrl %%d2\n\t"
diff -ur 2.4/arch/mips/kernel/process.c build-2.4/arch/mips/kernel/process.c
--- 2.4/arch/mips/kernel/process.c  Sat Apr  7 22:01:56 2001
+++ build-2.4/arch/mips/kernel/process.cSun Apr 15 14:52:12 2001
@@ -161,6 +161,8 @@
 {
long retval;
 
+   flags |= SIGCHLD;
+
__asm__ __volatile__(
".set\tnoreorder\n\t"
"move\t$6,$sp\n\t"
diff -ur 2.4/arch/mips64/kernel/process.c build-2.4/arch/mips64/kernel/process.c
--- 2.4/arch/mips64/kernel/process.cThu Feb 22 22:28:55 2001
+++ build-2.4/arch/mips64/kernel/process.c  Sun Apr 15 14:52:21 2001
@@ -154,6 +154,8 @@
 {
int retval;
 
+   flags |= SIGCHLD;
+
__asm__ __volatile__(
"move\t$6, $sp\n\t"
"move\t$4, %5\n\t"
diff -ur 2.4/arch/parisc/kernel/entry.S build-2.4/arch/parisc/kernel/entry.S
--- 2.4/arch/parisc/kernel/entry.S  Sat Apr  7 22:01:58 2001
+++ build-2.4/arch/parisc/kernel/entry.SSun Apr 15 14:56:58 2001
@@ -497,7 +497,7 @@
 #endif
STREG   %r26, PT_GR26

Re: fsck, raid reconstruction

2001-04-16 Thread Manfred Spraul

The first 2 problems aren't real problems (modify /etc/fstab, perhaps a
special ioctl could be added to raid and fsck stops the reconstruction)
- at most anoying, but clearly no bugs.

But the third one could be a bug:
> 
> Third problem: 
> 
> I just tried boot 2.4.3 today. (after an unclean shutdown) fsck runs 
> at a crawl on my RAID-1 volume. It would take all day (!! literally) 
> to fsck. The disk-drive activity light flashes about once a second, 
> maybe once every two seconds. (with a corresponding click from the 
> drive). 

Can you boot without the raid-1 volume?
Run top/vmstat during the fsck+reconstruction - I assume the system runs
out of memory and bdflush/kswapd are looping.

--  
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: filp_open() in 2.2.19 causes memory corruption

2001-04-23 Thread Manfred Spraul

Are you sure the trace is decoded correctly?

> CPU:0 
> EIP:0010:[sys_mremap+31/884] 
> EFLAGS: 00010206

> Code: ac ae 75 08 84 c0 75 f8 31 c0 eb 04 19 c0 0c 01 85 c0 75 d9
ac ae is
lodsb
scasb

Could you run
#objdump --disassemble-all --reloc linux/mm/mremap.o | less

and check that the code is really at offset 31 of sys_mremap?

And is it correct that only 64 MB memory is installed/enabled?

--
Manfred


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-23 Thread Manfred Spraul

> Well looking a little more closely than I did last night it looks like
> access_process_vm (called from ptrace) can cause what amounts to a
> page fault at pretty arbitrary times.

It's also used for several /proc/ files.

I remember that I got crashes with concurrent exec+cat
/proc//cmdline until down(mmap_sem) was added into
setup_arg_pages().

> I'm actually a little curious what the big kernel lock in ptrace buys
> us. I suspect it could be a performance issue with user mode linux.
> Where you have multiple processes being ptraced at the same time.

I checked it a few months ago, and the lock is only required to prevent
multiple concurrent attaches to the same process and concorrent
ptrace/suid exec (in fs/exec.c, compute_creds)

--
Manfred





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] cache directory position in dcache for ext2

2001-02-13 Thread Manfred Spraul

ext2_find_entry is quite expensive for directories with lots of entries
- what about caching the block and offset in the dcache?

Each dentry contains 2 values reserved for the filesystem:
dentry->d_fsdata
dentry->d_time

ext2 doesn't use them - yet.

I've written a small patch that caches the directory block & offset into
these variables, and one test (lots of create, remove, rename in one
directory) is 20% faster than the unpatched 2.4.1 kernel.

No disk format change.

The patch is attached, it survives dbench, kernel compiles and other
stress tests.
But that doesn't mean that it won't eat your filesystem ;-)
--
Manfred

diff -u 2.4/fs/ext2/namei.c build-2.4/fs/ext2/namei.c
--- 2.4/fs/ext2/namei.c Sat Dec  9 02:35:54 2000
+++ build-2.4/fs/ext2/namei.c   Tue Feb 13 14:49:46 2001
@@ -58,7 +58,7 @@
  * entry - you'll have to do that yourself if you want to.
  */
 static struct buffer_head * ext2_find_entry (struct inode * dir,
-const char * const name, int namelen,
+struct dentry * dentry,
 struct ext2_dir_entry_2 ** res_dir)
 {
struct super_block * sb;
@@ -70,7 +70,49 @@
*res_dir = NULL;
sb = dir->i_sb;
 
-   if (namelen > EXT2_NAME_LEN)
+   if(dentry->d_fsdata != NULL) {
+   struct buffer_head * bh;
+   struct ext2_dir_entry_2 * de;
+   /*
+* cached data, use it.
+*
+*/
+   block = (unsigned int)dentry->d_fsdata;
+   bh = getblk(sb->s_dev, block, sb->s_blocksize);
+   if (!bh) {
+   /* out of memory? */
+   return NULL;
+   }
+   if (!buffer_uptodate(bh)) {
+   ll_rw_block (READ, 1, &bh);
+   }
+   wait_on_buffer (bh);
+   if (!buffer_uptodate(bh)) {
+   /*
+* read error: all bets are off
+*/
+   brelse(bh);
+   return NULL;
+   }
+   de = (struct ext2_dir_entry_2 *) (bh->b_data+dentry->d_time);
+   if(!ext2_match (dentry->d_name.len, dentry->d_name.name, de)) {
+   ext2_error (sb, "ext2_find_entry",
+   "bad cached directory pos");
+   brelse(bh);
+   return NULL;
+   }
+   /* found a match -
+  just to be sure, do a full check */
+   if (!ext2_check_dir_entry("ext2_find_entry",
+ dir, de, bh, dentry->d_time)) {
+   brelse(bh);
+   return NULL;
+   }
+   *res_dir = de;
+   return bh;
+   }
+
+   if (dentry->d_name.len > EXT2_NAME_LEN)
return NULL;
 
memset (bh_use, 0, sizeof (bh_use));
@@ -120,8 +162,8 @@
/* do minimal checking `by hand' */
int de_len;
 
-   if ((char *) de + namelen <= dlimit &&
-   ext2_match (namelen, name, de)) {
+   if ((char *) de + dentry->d_name.len <= dlimit &&
+   ext2_match (dentry->d_name.len, dentry->d_name.name, de)) {
/* found a match -
   just to be sure, do a full check */
if (!ext2_check_dir_entry("ext2_find_entry",
@@ -131,6 +173,8 @@
if (bh_use[i] != bh)
brelse (bh_use[i]);
}
+   dentry->d_fsdata = (void*)bh->b_blocknr;
+   dentry->d_time = ((long)de)-((long)bh->b_data);
*res_dir = de;
return bh;
}
@@ -169,7 +213,7 @@
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
 
-   bh = ext2_find_entry (dir, dentry->d_name.name, dentry->d_name.len, &de);
+   bh = ext2_find_entry (dir, dentry, &de);
inode = NULL;
if (bh) {
unsigned long ino = le32_to_cpu(de->inode);
@@ -206,7 +250,7 @@
  *
  * adds a file entry to the specified directory.
  */
-int ext2_add_entry (struct inode * dir, const char * name, int namelen,
+static int ext2_add_entry (struct inode * dir, struct dentry *dentry,
struct inode *inode)
 {
unsigned long offset;
@@ -218,12 +262,12 @@
 
sb = dir->i_sb;
 
-   if (!namelen)
+   if (!dentry->d_name.len)
return -EINVAL;
bh = ext2_bread (dir, 0, 0, &retval);
if (!bh)
 

Re: Selects on dirs/files.

2001-02-13 Thread Manfred Spraul

"N. Jason Kleinbub" wrote:
> 
> People,
> 
> Not sure where to go from here but 
> 
> ( Yes I have read the FAQ )
> 
> For practical reasons, I have created some modification to the
> Linux kernel.  These changes were to make our implementation of
> software more convenient (elegant).  Essentially, I have modified the
> select() calls to allow the non-trivial use of directories as an 'fd'.
>
Have you checked the F_NOTIFY fcntl()?

If yes, what's the difference between your patch and F_NOTIFY?

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG in sched.c, Kernel 2.4.1?

2001-02-13 Thread Manfred Spraul

Martin Rode wrote:
> 
> >
> > Run this oops message through ksymoops please.  It will make debugging
> > it alot easier.
> >
> >
> 
> Since I did not compile the kernel myself, ksymoops is not too happy with
> what is has to analyse the dump. I tried compile the Mandrake kernel myself
> but there seems to be something unmatched. See below for what ksymoops
> gives me.
>
looks good.

> Warning (compare_maps): mismatch on symbol vt_cons  , ksyms_base says
> c02b06e0, vmlinux says c02ac6e0.  Ignoring ksyms_base entry
> 
> (I get about > 300 msgs of that kind)
> 
> Let me know who I can prepare for the next crash with my own kernel. Are
> there any options I have to turn on for compiling?
> 
> kernel BUG at sched.c:714!
> invalid operand: 
> CPU: 0
> EIP: 0010:[]
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010282
> eax: 001b ebx  ecx df4f6000 edx 0001
> esi: 001cffe3 edi db5eede0 ebp dc0e9f40 esp dc0e9ef0
> stack: c01f26f3 c01f2856 02ca db5eed80 dc0e8000 db5eede0 dc0e9f18
> dc0e8000 33ba   00e7 001c 001c
> fff3 dc0e8000 0800  dc0e8000 dc0e9f68 c0139c44
> d488bf80 

esp is quite high, only 0x110 bytes of the stack are used.

> call trace: [] [] [] []
^
> code: 0f 0b 8d 65 bc 5b 5e 5f 89 ec 5d c3 8d 76 00 55 89 e5 83 ec
> 
> >>EIP; c0113781<=
> Trace; cc0139c44 
 ^

did you manually copy the oops from the screen?
that value should be c0139c44 

> Trace; c0139d1c 
> Trace; c0130af6 
> Trace; c0108e93 
>
> [snip]
> 
> Kernel panic: Aiee, killing interrupt handler!
>
I don't see that interrupt handler!
it seems to be a normal syscall, just a pipe read that blocks because
the pipe is empty.

Is that the first oops, or was there another oops before this one?

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups

2001-02-13 Thread Manfred Spraul

"Maciej W. Rozycki" wrote:
> 
> Hi,
> 
>  After performing various tests I came to the following workaround for
> APIC lockups which people observe under IRQ load, mostly for networking
> stuff.  I believe the test should work in all cases as it basically
> implements a manual replacement for EOI messages.  In my simulated
> environment I was unable to get a lockup with the code in place, even
> though I was getting about every other level-triggered IRQ misdelivered.
> 
>  Please test it extensively, as much as you can, before I submit it for
> inclusion.  If you ever get "Aieee!!!  Remote IRR still set after unlock!"
> message, please report it to me immediately -- it means the code failed.
>
No messages.

> There is also an additional debugging/statistics counter provided in
> /proc/cpuinfo that counts interrupts which got delivered with its trigger
> mode mismatched.  Check it out to find if you get any misdelivered
> interrupts at all.
> 
I'm running my default webserver load test, and I get ~40 /second, 92735
total.

bw_tcp says 1.13 MB/sec, that's wire speed.

tcpdump | grep 'sack ' doesn't show unusually many lost packets.

Look promising.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: block ioctl to read/write last sector

2001-02-13 Thread Manfred Spraul

Michael E Brown wrote:
> 
> >
> > Anyway, an ioctl just to read the last sector is too silly.
> > An ioctl to change the blocksize is more reasonable.
> 
> That may be better, I don't know. That's why this is an RFC. Are there any
> possible races with that method? It seems to me that you might adversely
> affect io in progress by changing the blocksize. The method demonstrated
> in this patch shouldn't do that.
>
block size changing is dangerous:
if you change the blocksize of a mounted partition you'll disrupt the
filesystem.
some kernels crash hard if you execute

swapon /dev/

swapon won't overwrite your root fs: it changes the blocksize to 4kB and
then notices that there is no swap signature.
But the blocksize change is fatal.

> > And I expect that this fixed blocksize will go soon.
>
that's 2.5.

> That may be, I don't know that much about the block layer. All I know is
> that, with the current structure, I cannot read or write to sectors where
> (sector #) > total-disk-blocks - (total-disk-blocks /
> (softblocksize/hardblocksize))
>
I have one additional user space only idea:
have you tried raw-io? bind a raw device to the partition, IIRC raw-io
is always in 512 byte units.

Probably an ioctl is the better idea, but I'd use absolute sector
numbers (not relative to the end), and obviously 64-bit sector numbers -
2 TB isn't that far away.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] network driver updates

2001-02-14 Thread Manfred Spraul

I found 2 bugs in several network drivers:

* dev->mem_start: NULL means "not command line configuration" 0x
means "default".
several drivers only check for NULL, not for 0x.

* incorrect bounds checks for phy_idx: 2 entries in the structure, but
up to 4 are initialized.

* something is wrong in the vortex initialization: I don't have such a
card, but the driver didn't return an error message on insmod. I'm not
sure if
my fix is correct.

Patch attached, against 2.4.1-ac12.

--
Manfred

diff -ur 2.4/drivers/net/3c59x.c build-2.4/drivers/net/3c59x.c
--- 2.4/drivers/net/3c59x.c Wed Feb 14 10:50:50 2001
+++ build-2.4/drivers/net/3c59x.c   Wed Feb 14 12:32:56 2001
@@ -967,7 +967,7 @@
pdev->driver_data = dev;
 
/* The lower four bits are the media type. */
-   if (dev->mem_start) {
+   if (dev->mem_start && dev->mem_start != 0x) {
/*
 * AKPM: ewww..  The 'options' param is passed in as the third arg to 
the
 * LILO 'ether=' argument for non-modular use
@@ -2661,9 +2661,12 @@

rc = pci_module_init(&vortex_driver);
if (rc < 0) {
-   rc = vortex_eisa_init();
-   if (rc > 0)
+   int rc2;
+   rc2 = vortex_eisa_init();
+   if (rc2 > 0) {
vortex_have_eisa = 1;
+   rc = 0;
+   }
} else {
vortex_have_pci = 1;
}
diff -ur 2.4/drivers/net/eepro100.c build-2.4/drivers/net/eepro100.c
--- 2.4/drivers/net/eepro100.c  Wed Feb 14 10:50:54 2001
+++ build-2.4/drivers/net/eepro100.cWed Feb 14 11:23:22 2001
@@ -643,7 +643,7 @@
return -1;
}
 
-   if (dev->mem_start > 0)
+   if (dev->mem_start && dev->mem_start != 0x)
option = dev->mem_start;
else if (card_idx >= 0  &&  options[card_idx] >= 0)
option = options[card_idx];
diff -ur 2.4/drivers/net/epic100.c build-2.4/drivers/net/epic100.c
--- 2.4/drivers/net/epic100.c   Wed Feb 14 10:50:54 2001
+++ build-2.4/drivers/net/epic100.c Wed Feb 14 11:17:37 2001
@@ -412,7 +412,7 @@
ep->rx_ring = (struct epic_rx_desc *)ring_space;
ep->rx_ring_dma = ring_dma;
 
-   if (dev->mem_start) {
+   if (dev->mem_start && dev->mem_start != 0x) {
option = dev->mem_start;
duplex = (dev->mem_start & 16) ? 1 : 0;
} else if (card_idx >= 0  &&  card_idx < MAX_UNITS) {
diff -ur 2.4/drivers/net/hamachi.c build-2.4/drivers/net/hamachi.c
--- 2.4/drivers/net/hamachi.c   Wed Feb 14 10:50:54 2001
+++ build-2.4/drivers/net/hamachi.c Wed Feb 14 11:26:54 2001
@@ -477,6 +477,7 @@
 };
 
 #define PRIV_ALIGN   15/* Required alignment mask */
+#define MII_CNT4
 struct hamachi_private {
/* Descriptor rings first for alignment.  Tx requires a second descriptor
   for status. */
@@ -503,7 +504,7 @@
/* MII transceiver section. */
int mii_cnt;/* MII 
device addresses. */
u16 advertising;/* 
NWay media advertisement */
-   unsigned char phys[2];  /* MII device 
addresses. */
+   unsigned char phys[MII_CNT];/* MII device addresses, only first 
+one used. */
u_int32_t rx_int_var, tx_int_var;   /* interrupt control variables */
u_int32_t option;   /* 
Hold on to a copy of the options */
u_int8_t pad[16];   /* 
Used for 32-byte alignment */
@@ -621,7 +622,7 @@
 
/* Check for options being passed in */
option = card_idx < MAX_UNITS ? options[card_idx] : 0;
-   if (dev->mem_start)
+   if (dev->mem_start && dev->mem_start != 0x)
option = dev->mem_start;
 
/* If the bus size is misidentified, do the following. */
@@ -705,7 +706,7 @@
 
if (chip_tbl[hmp->chip_id].flags & CanHaveMII) {
int phy, phy_idx = 0;
-   for (phy = 0; phy < 32 && phy_idx < 4; phy++) {
+   for (phy = 0; phy < 32 && phy_idx < MII_CNT; phy++) {
int mii_status = mdio_read(ioaddr, phy, 1);
if (mii_status != 0x  &&
mii_status != 0x) {
diff -ur 2.4/drivers/net/natsemi.c build-2.4/drivers/net/natsemi.c
--- 2.4/drivers/net/natsemi.c   Wed Feb 14 10:50:56 2001
+++ build-2.4/drivers/net/natsemi.c Wed Feb 14 11:17:52 2001
@@ -446,7 +446,7 @@
np->iosize = iosize;
spin_lock_init(&np->lock);
 
-   if (dev->mem_start)
+   if (dev->mem_start && dev->mem_start != 0x)
option = dev->mem_start;
 
/* The lower four 

Re: [PATCH] network driver updates

2001-02-14 Thread Manfred Spraul

David Hinds wrote:
> 
> Say the driver is linked into the kernel.  Hot plug drivers should not
> all complain about not finding their hardware.
>

That's handled by pci_module_init(), check :
if CONFIG_HOTPLUG is enabled, then pci_module_init() never returns with
-ENODEV.

Which means that eisa cards will never be probed in a hotplug enabled
kernel.

And loading the current 3c59x.c into a non CONFIG_HOTPLUG non EISA
kernel results in a disconnected driver:
it's _not_ registered as a pci driver, pci_module_init() calls
pci_unregister_driver().

What about the attached patch?

--- 2.4/drivers/net/3c59x.c Wed Feb 14 10:50:50 2001
+++ build-2.4/drivers/net/3c59x.c   Wed Feb 14 18:51:55 2001
@@ -2661,13 +2661,16 @@

rc = pci_module_init(&vortex_driver);
if (rc < 0) {
-   rc = vortex_eisa_init();
-   if (rc > 0)
-   vortex_have_eisa = 1;
+   if (rc != -ENODEV)
+   return rc;
} else {
vortex_have_pci = 1;
}
 
+   if (vortex_eisa_init() > 0) {
+   vortex_have_eisa = 1;
+   rc = 0;
+   }
return rc;
 }
 



Re: 2.4.1-ac13 tulip problems

2001-02-14 Thread Manfred Spraul


Which tulip card do you use?

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [beta patch] SSE copy_page() / clear_page()

2001-02-14 Thread Manfred Spraul

I have another idea for sse, and this one is far safer:

only use sse prefetch, leave the string operations for the actual copy.
The prefetch operations only prefetch, don't touch the sse registers,
thus neither any reentency nor interrupt problems.

I tried the attached hack^H^H^H^Hpatch, and read(fd, buf, 400) from
user space got 7% faster (from 264768842 cycles to 246303748 cycles,
single cpu, noacpi, 'linux -b', fastest time from several thousand
runs).

The reason why this works is simple:

Intel Pentium III and P 4 have hardcoded "fast stringcopy" operations
that invalidate whole cachelines during write (documented in the most
obvious place: multiprocessor management, memory ordering)

The result is a very fast write, but the read is still slow.

--
Manfred

--- 2.4/mm/filemap.cWed Feb 14 10:51:42 2001
+++ build-2.4/mm/filemap.c  Wed Feb 14 22:11:44 2001
@@ -1248,6 +1248,20 @@
size = count;
 
kaddr = kmap(page);
+   if (size > 128) {
+   int i;
+   __asm__ __volatile__(
+   "mov %1, %0\n\t"
+   : "=r" (i)
+   : "r" (kaddr+offset)); /* load tlb entry */
+   for(i=0;ibuf, kaddr + offset, size);
kunmap(page);




Re: More (other) NIC info/Problem: NIC doesn't work anymore, SIOCIFADDR-errors

2001-02-15 Thread Manfred Spraul

Rob Cermak wrote:
> 
> Anyone who can tell me what's going on here?
>
Perhaps it's the 'dev->memstart==~0' bug I found yesterday?

Could you go into line 450 of 3c509.c and replace

- dev->if_port = (dev->mem_start & 0x1f) ?dev->mem_start & 3: if_port;
+ printk(KERN_WARNING "%s: mem_start is %lxh.\n",dev->name,
dev->mem_start);
+ dev->if_port = if_port;

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Is this the ultimate stack-smash fix?

2001-02-15 Thread Manfred Spraul

"Eric W. Biederman" wrote:
> 
> But the gcc bounds checking work is the ultimate buffer overflow fix.
> You can recompile all of your trusted applications, and libraries with
> it and be safe from one source of bugs.
>

void main(int argc, char **argv[])
{
char local[128];
if(argc > 2)
strcpy(local,argv[1]);
}

Unless you modify the ABI and pass the array bounds around you won't
catch such problems, and I won't even mention unions and

struct dyn_data {
int len;
char data[];
}

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.1-ac14 tulip woes

2001-02-15 Thread Manfred Spraul

Nathan Walp wrote:
> 
> The fix in ac14 for the ac13 patch that killed the tulip driver doesn't
> quite work either:
>

I need more details:
does it immediately time out (after a few seconds), or a after a few
minutes.

Which network speed do you use? 100MBit half duplex?

Could you please run the tulip-diag program I send you yesterday?
#tulip-diag -mm -aa -f

both before and after the hang.

Thanks,
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-15 Thread Manfred Spraul

Kanoj Sarcar wrote:
> 
> Okay, I will quote from Intel Architecture Software Developer's Manual
> Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
> 
> "Bus cycles to the page directory and page tables in memory are performed
> only when the TLBs do not contain the translation information for a
> requested page."
> 
> And on the same page:
> 
> "Whenever a page directory or page table entry is changed (including when
> the present flag is set to zero), the operating system must immediately
> invalidate the corresponding entry in the TLB so that it can be updated
> the next time the entry is referenced."
>

But there is another paragraph that mentions that an OS may use lazy tlb
shootdowns.
[search for shootdown]

You check the far too obvious chapters, remember that Intel wrote the
documentation ;-)
I searched for 'dirty' though Vol 3 and found

Chapter 7.1.2.1 Automatic locking.

.. the processor uses locked cycles to set the accessed and dirty flag
in the page-directory and page-table entries.

But that obviously doesn't answer your question.

Is the sequence
<< lock;
read pte
pte |= dirty
write pte
>> end lock;
or
<< lock;
read pte
if (!present(pte))
do_page_fault();
pte |= dirty
write pte.
>> end lock;

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-15 Thread Manfred Spraul

Linus Torvalds wrote:
> 
> In article <[EMAIL PROTECTED]>,
> Jamie Lokier  <[EMAIL PROTECTED]> wrote:
> >> > << lock;
> >> > read pte
> >> > if (!present(pte))
> >> >do_page_fault();
> >> > pte |= dirty
> >> > write pte.
> >> > >> end lock;
> >>
> >> No, it is a little more complicated. You also have to include in the
> >> tlb state into this algorithm. Since that is what we are talking about.
> >> Specifically, what does the processor do when it has a tlb entry allowing
> >> RW, the processor has only done reads using the translation, and the
> >> in-memory pte is clear?
> >
> >Yes (no to the no): Manfred's pseudo-code is exactly the question you're
> >asking.  Because when the TLB entry is non-dirty and you do a write, we
> >_know_ the processor will do a locked memory cycle to update the dirty
> >bit.  A locked memory cycle implies read-modify-write, not "write TLB
> >entry + dirty" (which would be a plain write) or anything like that.
> >
> >Given you know it's a locked cycle, the only sensible design from Intel
> >is going to be one of Manfred's scenarios.
> 
> Not necessarily, and this is NOT guaranteed by the docs I've seen.
> 
> It _could_ be that the TLB data actually also contains the pointer to
> the place where it was fetched, and a "mark dirty" becomes
> 
> read *ptr locked
> val |= D
> write *ptr unlock
>

Jamie wrote "one of my scenarios", that's the other option ;-)

> Now, I will agree that I suspect most x86 _implementations_ will not do
> this. TLB's are too timing-critical, and nobody tends to want to make
> them bigger than necessary - so saving off the source address is
> unlikely. Also, setting the D bit is not a very common operation, so
> it's easy enough to say that an internal D-bit-fault will just cause a
> TLB re-load, where the TLB re-load just sets the A and D bits as it
> fetches the entry (and then page fault handling is an automatic result
> of the reload).
>

But then the cpu would support setting the D bit in the page directory,
but it doesn't.

Probably Kanoj is right, the current code is not guaranteed by the
specs.

But if we change the interface, could we think about the poor s390
developers?

s390 only has a "clear the present bit in the pte and flush the tlb"
instruction.


>From your other post:
>pte = ptep_get_and_clear(page_table);
>flush_tlb_page(vma, address);
>+   pte = ptep_update_after_flush(page_table, pte);

What about one arch specific

pte = ptep_get_and_invalidate(vma, address, page_table);


On i386 SMP it would

{
pte = *page_table_entry;
if(!present(pte))
return pte;
lock; andl 0xfffe, *page_table_entry;
flush_tlb_page();
return *page_table_entry | 1;
}

> 
> The "gather" operation could possibly be improved to make the other
> CPU's do useful work while being shot down (ie schedule away to another
> mm), but that has it's own pitfalls too.
>
IMHO scheduling away is the best long term solution.
Perhaps try to schedule away, just to improve the probability that
mm->cpu_vm_mask is clear.

I just benchmarked a single flush_tlb_page().

Pentium II 350: ~ 2000 cpu ticks.
Pentium III 850: ~ 3000 cpu ticks.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-15 Thread Manfred Spraul

Manfred Spraul wrote:
> 
> I just benchmarked a single flush_tlb_page().
> 
> Pentium II 350: ~ 2000 cpu ticks.
> Pentium III 850: ~ 3000 cpu ticks.
>
I forgot the important part:
SMP, including a smp_call_function() IPI.

IIRC Ingo wrote that a local 'invplg' is around 100 ticks.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Tulip in 2.4.1-ac14 still poorly

2001-02-16 Thread Manfred Spraul

Stephen Thomas wrote:
> 
> On trying 2.4.1-ac13 I hit the tulip driver problems reported elsewhere,
> and ac14 does not seem to fix the problem on my machine.  Attached is an
> extract from my /var/log/messages.
>

Could you try the attached oneliner patches? 

patch-tulip-fix1 is integrated in -ac15, and I send patch-tulip-typo to
Alan a few seconds ago.

--
Manfred

diff -u 2.4/drivers/net/tulip/pnic.c build-2.4/drivers/net/tulip/pnic.c
--- 2.4/drivers/net/tulip/pnic.cFri Feb 16 11:17:03 2001
+++ build-2.4/drivers/net/tulip/pnic.c  Fri Feb 16 11:18:08 2001
@@ -84,7 +84,7 @@
tp->full_duplex_lock;
 
new_csr6 = tp->csr6;
-   if (negotiated & 0x038) /* 100mbps. */
+   if (negotiated & 0x0380)/* 100mbps. */
new_csr6 &= ~0x0040;
 else
new_csr6 |= 0x0040;



diff -u 2.4/drivers/net/tulip/pnic.c build-2.4/drivers/net/tulip/pnic.c
--- 2.4/drivers/net/tulip/pnic.cThu Feb 15 00:51:38 2001
+++ build-2.4/drivers/net/tulip/pnic.c  Thu Feb 15 01:05:59 2001
@@ -93,8 +93,6 @@
 else   
new_csr6 &= ~0x0200;
if (new_csr6 != tp->csr6) {
-   /* stop the transceiver*/
-   tulip_stop_rxtx(tp, tp->csr6);
tp->full_duplex = duplex;
tp->csr6 = new_csr6;
if (tulip_debug > 0)
@@ -102,10 +100,7 @@
   "#%d link partner capability of %4.4x.\n",
   dev->name, tp->full_duplex ? "full" : "half",
   tp->phys[0], mii_reg5);
-   /* When the transceiver is stopped it triggeres
-* a "Transmit stopped interrupt" (misnamed as TxDied).
-* The interrupt handler will restart the transceiver
-*/
+   tulip_restart_rxtx(tp, tp->csr6);
return 1;
}
return 0;




Re: x86 ptep_get_and_clear question

2001-02-16 Thread Manfred Spraul

Jamie Lokier wrote:
> 
> Linus Torvalds wrote:
> > So the only case that ends up being fairly heavy may be a case that is
> > very uncommon in practice (only for unmapping shared mappings in
> > threaded programs or the lazy TLB case).
>
The lazy tlb case is quite fast: lazy tlb thread never write to user
space pages, we don't need to protect the dirty bits. And the first ipi
clears mm->cpu_vm_mask, only one ipi.
>
> I can think of one case where performance is considered quite important:
> mprotect() is used by several garbage collectors, including threaded
> ones.  Maybe mprotect() isn't the best primitive for those anyway, but
> it's what they have to work with atm.
>

Does mprotect() actually care for wrong dirty bits?
The race should be invisible to user space apps.

>>> mprotect()
for_all_affected_ptes() {
lock andl ~PERMISSION_MASK, *pte;
lock orl new_permission, *pte;
}
< now anther cpu could still write to the write protected pages
< and set the dirty bit, but who cares? Shouldn't be a problem.
flush_tlb_range().
< tlb flush before ending the syscall, user space can't notice
< the delay.


--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Manfred Spraul

Jamie Lokier wrote:
> 
> /* mprotect.c */
> entry = ptep_get_and_clear(pte);
> set_pte(pte, pte_modify(entry, newprot));
> 
> I.e. the only code with the race condition is code which explicitly
> clears the dirty bit, in vmscan.c.
> 
> Do you see any possibility of losing a dirty bit here?
>
Of course.
Just check the output after preprocessing.
It's 
int entry;
entry = *pte;
entry &= ~_PAGE_CHG_MASK;
entry |= pgprot_val(newprot)
*pte = entry;

We need
atomic_clear_mask (_PAGE_CHG_MASK, pte);
atomic_set_mask (pgprot_val(newprot), *pte);

for multi threaded apps.

> If not, there's no need for the intricate "gather" or "double scan"
> schemes for mprotect() and it can stay as fast as possible.
>
Correct, but we need a platform specific "update_pte", and perhaps
update_begin, update_end hooks (empty on i386) for other archs.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Manfred Spraul

Jamie Lokier wrote:
> 
> And how does that lose a dirty bit?
> 
> For the other processor to not write a dirty bit, it must have a dirty
^^^
> TLB entry already which, along with the locked cycle in
> ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY set.  The
> dirty bit is not lost.
> 
The other cpu writes the dirty bit - we just overwrite it ;-)
After the ptep_get_and_clear(), before the set_pte().

The current assumption about the page dirty logic is:
A cpu that has a writable, non-dirty pte cached in its tlb it may
unconditionally set the dirty bit - without honoring present or write
protected bits.

--> set_pte() can either loose a dirty bit or a 'pte_none() entry' could
suddenly become a swap entry unless it's guaranteed that no cpus has a
cached valid tlb entry.

Linus, does the proposed pte gather code handle the second part?
pte_none() suddenly becomes 0x0040.

Back to the current mprotect.c code:

pte is writable, not-dirty.

cpu1:
has a writable, non-dirty pte in it's tlb.
cpu 2: in mprotect.c
entry = ptep_get_and_clear(pte);
* pte now clear.
* entry contains the pte value without
  the dirty bit
cpu decodes a write instruction, and dirties the pte.
lock; orl DIRTY_BIT, *pte
set_pte(pte, pte_modify(entry, newprot));
* pte overwritten with entry.

--> dirty bit lost.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   3   4   5   6   7   >