[PATCH RESEND 0/3] partitions: efi: tighten gpt header integrity checks

2012-10-25 Thread Davidlohr Bueso
Hi Jens,

This is a resend of a patchset sent in early September to harden GPT header 
checks.

 partitions: efi: check minimum header size
 partitions: efi: verify header is outside usable area
 partitions: efi: compare first and last usable LBAs

 block/partitions/efi.c |   20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)


Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND 1/3] partitions: efi: compare first and last usable LBAs

2012-10-25 Thread Davidlohr Bueso
When verifying GPT header integrity, make sure that
first usable LBA is smaller than last usable LBA.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 6296b40..7795bb4 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -344,6 +344,12 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
 * within the disk.
 */
lastlba = last_lba(state-bdev);
+   if (le64_to_cpu((*gpt)-last_usable_lba)  
le64_to_cpu((*gpt)-first_usable_lba)) {
+   pr_debug(GPT: last_usable_lba incorrect: %lld  %lld\n,
+(unsigned long 
long)le64_to_cpu((*gpt)-last_usable_lba),
+(unsigned long 
long)le64_to_cpu((*gpt)-first_usable_lba));
+   goto fail;
+   }
if (le64_to_cpu((*gpt)-first_usable_lba)  lastlba) {
pr_debug(GPT: first_usable_lba incorrect: %lld  %lld\n,
 (unsigned long 
long)le64_to_cpu((*gpt)-first_usable_lba),
-- 
1.7.9.5




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND 2/3] partitions: efi: verify header is outside usable area

2012-10-25 Thread Davidlohr Bueso
The first usable logical block can be used by a GUID partition
entry, and therefore cannot be used by the header.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 7795bb4..abf33a2 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -363,6 +363,13 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
goto fail;
}
 
+   /* The header must be outside usable range */
+   if (le64_to_cpu((*gpt)-first_usable_lba)  lba 
+   le64_to_cpu((*gpt)-last_usable_lba)  lba) {
+   pr_debug(GPT: Header is inside usable area\n);
+   goto fail;
+   }
+
/* Check that sizeof_partition_entry has the correct value */
if (le32_to_cpu((*gpt)-sizeof_partition_entry) != sizeof(gpt_entry)) {
pr_debug(GUID Partitition Entry Size check failed.\n);
-- 
1.7.9.5




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND 3/3] partitions: efi: check minimum header size

2012-10-25 Thread Davidlohr Bueso
As per UEFI specs 2.3.1 (June 2012),
The Header Size must be greater than 92 and must be less than
or equal to the logical block size

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index abf33a2..688b59c 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -25,6 +25,9 @@
  * TODO:
  *
  * Changelog:
+ * Oct 2012 Davidlohr Bueso d...@gnu.org
+ * - tighten GPT header integrity verification.
+ *
  * Mon Nov 09 2004 Matt Domsch matt_dom...@dell.com
  * - test for valid PMBR and valid PGPT before ever reading
  *   AGPT, allow override with 'gpt' kernel command line option.
@@ -311,8 +314,8 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
}
 
/* Check the GUID Partition Table header size */
-   if (le32_to_cpu((*gpt)-header_size) 
-   bdev_logical_block_size(state-bdev)) {
+   if (le32_to_cpu((*gpt)-header_size)  92 ||
+   le32_to_cpu((*gpt)-header_size)  
bdev_logical_block_size(state-bdev)) {
pr_debug(GUID Partition Table Header size is wrong: %u  %u\n,
le32_to_cpu((*gpt)-header_size),
bdev_logical_block_size(state-bdev));
-- 
1.7.9.5





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86, amd: rename vmmu support capability

2012-07-13 Thread Davidlohr Bueso
From: Davidlohr Bueso d...@gnu.org

AMD has renamed nested page table technology to rapid virtualization indexing,
reflect this change in the kernel.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 arch/x86/include/asm/cpufeature.h |2 +-
 arch/x86/kernel/cpu/scattered.c   |2 +-
 arch/x86/kvm/svm.c|2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index f91e80f..a6fa778 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -185,7 +185,7 @@
 #define X86_FEATURE_FLEXPRIORITY (8*32+ 2) /* Intel FlexPriority */
 #define X86_FEATURE_EPT (8*32+ 3) /* Intel Extended Page Table */
 #define X86_FEATURE_VPID(8*32+ 4) /* Intel Virtual Processor ID */
-#define X86_FEATURE_NPT(8*32+ 5) /* AMD Nested Page Table 
support */
+#define X86_FEATURE_RVI(8*32+ 5) /* AMD Rapid Virtualization 
Indexing support */
 #define X86_FEATURE_LBRV   (8*32+ 6) /* AMD LBR Virtualization support */
 #define X86_FEATURE_SVML   (8*32+ 7) /* svm_lock AMD SVM locking MSR */
 #define X86_FEATURE_NRIPS  (8*32+ 8) /* nrip_save AMD SVM next_rip save 
*/
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index ee8e9ab..78ec9e6 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -41,7 +41,7 @@ void __cpuinit init_scattered_cpuid_features(struct 
cpuinfo_x86 *c)
{ X86_FEATURE_XSAVEOPT, CR_EAX, 0, 0x000d, 1 },
{ X86_FEATURE_CPB,  CR_EDX, 9, 0x8007, 0 },
{ X86_FEATURE_HW_PSTATE,CR_EDX, 7, 0x8007, 0 },
-   { X86_FEATURE_NPT,  CR_EDX, 0, 0x800a, 0 },
+   { X86_FEATURE_RVI,  CR_EDX, 0, 0x800a, 0 },
{ X86_FEATURE_LBRV, CR_EDX, 1, 0x800a, 0 },
{ X86_FEATURE_SVML, CR_EDX, 2, 0x800a, 0 },
{ X86_FEATURE_NRIPS,CR_EDX, 3, 0x800a, 0 },
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f75af40..6863898 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -900,7 +900,7 @@ static __init int svm_hardware_setup(void)
goto err;
}
 
-   if (!boot_cpu_has(X86_FEATURE_NPT))
+   if (!boot_cpu_has(X86_FEATURE_RVI))
npt_enabled = false;
 
if (npt_enabled  !npt) {
-- 
1.7.4.1



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, amd: rename vmmu support capability

2012-07-14 Thread Davidlohr Bueso
On Sat, 2012-07-14 at 12:19 +0200, Borislav Petkov wrote:
 On Fri, Jul 13, 2012 at 08:02:55PM +0200, Davidlohr Bueso wrote:
  From: Davidlohr Bueso d...@gnu.org
  
  AMD has renamed nested page table technology to rapid virtualization 
  indexing,
  reflect this change in the kernel.
  
  Signed-off-by: Davidlohr Bueso d...@gnu.org
 
 You know that /proc/cpuinfo is a userspace ABI, right?

Yes.

 
 And are you sure nothing is using that string -
 npt - since it got added almost three years ago by
 414bb144efa2d2fe16d104d836d0d6b6e9265788?

AFAIK no, it's not being used - that doesn't mean, of course, that there
are no users.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, amd: rename vmmu support capability

2012-07-14 Thread Davidlohr Bueso
On Sat, 2012-07-14 at 15:38 +0200, H. Peter Anvin wrote:
 Yep, NAK on this one.

Ok, we could at least add a comment when defining X86_FEATURE_NPT.

Thanks,
Davidlohr

 
 Borislav Petkov b...@alien8.de wrote:
 
 On Fri, Jul 13, 2012 at 08:02:55PM +0200, Davidlohr Bueso wrote:
  From: Davidlohr Bueso d...@gnu.org
  
  AMD has renamed nested page table technology to rapid virtualization
 indexing,
  reflect this change in the kernel.
  
  Signed-off-by: Davidlohr Bueso d...@gnu.org
 
 You know that /proc/cpuinfo is a userspace ABI, right?
 
 And are you sure nothing is using that string -
 npt - since it got added almost three years ago by
 414bb144efa2d2fe16d104d836d0d6b6e9265788?
 
 -- 
 Regards/Gruss,
 Boris.
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

2013-04-15 Thread Davidlohr Bueso
On Mon, 2013-04-15 at 10:37 -0400, Waiman Long wrote:
[...]
 +typedef struct mspin_node {
 + struct mspin_node *next;
 + intlocked;  /* 1 if lock acquired */
 +} mspin_node_t;
 +
 +typedef mspin_node_t *mspin_lock_t;

I think we could do without the typedefs, specially mspin_lock_t.

 +
 +#define  MLOCK(mutex)((mspin_lock_t *)((mutex)-spin_mlock))
 +
 +static noinline void mspin_lock(mspin_lock_t *lock,  mspin_node_t *node)
 +{
 + mspin_node_t *prev;
 +
 + /* Init node */
 + node-locked = 0;
 + node-next   = NULL;
 +
 + prev = xchg(lock, node);
 + if (likely(prev == NULL)) {
 + /* Lock acquired */
 + node-locked = 1;
 + return;
 + }
 + ACCESS_ONCE(prev-next) = node;
 + smp_wmb();
 + /* Wait until the lock holder passes the lock down */
 + while (!ACCESS_ONCE(node-locked))
 + arch_mutex_cpu_relax();
 +}
 +
 +static void mspin_unlock(mspin_lock_t *lock,  mspin_node_t *node)
 +{
 + mspin_node_t *next = ACCESS_ONCE(node-next);
 +
 + if (likely(!next)) {
 + /*
 +  * Release the lock by setting it to NULL
 +  */
 + if (cmpxchg(lock, node, NULL) == node)
 + return;
 + /* Wait until the next pointer is set */
 + while (!(next = ACCESS_ONCE(node-next)))
 + arch_mutex_cpu_relax();
 + }
 + barrier();
 + ACCESS_ONCE(next-locked) = 1;
 + smp_wmb();

Do we really need the compiler barrier call? The CPUs can reorder
anyway. I assume the smp_wbm() call makes sure no there's no funny
business before the next lock is acquired, might be worth commenting.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 3/4] mutex: Queue mutex spinners with MCS lock to reduce cacheline contention

2013-04-17 Thread Davidlohr Bueso
On Wed, 2013-04-17 at 15:23 -0400, Waiman Long wrote:
 The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
 on) allow multiple tasks to spin on a single mutex concurrently. A
 potential problem with the current approach is that when the mutex
 becomes available, all the spinning tasks will try to acquire the
 mutex more or less simultaneously. As a result, there will be a lot of
 cacheline bouncing especially on systems with a large number of CPUs.
 
 This patch tries to reduce this kind of contention by putting the
 mutex spinners into a queue so that only the first one in the queue
 will try to acquire the mutex. This will reduce contention and allow
 all the tasks to move forward faster.
 
 The queuing of mutex spinners is done using an MCS lock based
 implementation which will further reduce contention on the mutex
 cacheline than a similar ticket spinlock based implementation. This
 patch will add a new field into the mutex data structure for holding
 the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
 and 4 bytes for 32-bit system. This overhead will be avoid if the
 MUTEX_SPIN_ON_OWNER option is turned off.
 
 The following table shows the jobs per minute (JPM) scalability data
 on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
 command is used to restrict the running of the fserver workloads to
 1/2/4/8 nodes with hyperthreading off.
[...]
 
 The short workload is the only one that shows a decline in performance
 probably due to the spinner locking and queuing overhead.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Acked-by: Rik van Riel r...@redhat.com

Reviewed-by: Davidlohr Bueso davidlohr.bu...@hp.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Apr 18 [ call-trace: drm | x86 | smp | rcu related? ]

2013-04-19 Thread Davidlohr Bueso
On Fri, 2013-04-19 at 15:19 -0400, Rik van Riel wrote:
 On 04/19/2013 02:53 PM, Sedat Dilek wrote:
  On Fri, Apr 19, 2013 at 6:43 PM, Sedat Dilek sedat.di...@gmail.com wrote:
 
  I tried to switch from SLUB to SLAB...
 
  ...and also from VIRT_CPU_ACCOUNTING_GEN to TICK_CPU_ACCOUNTING.
 
  2x NOPE.
 
  In one kernel-build I saw in my console...
 
semop(1): encountered an error: Identifier removed

This looks like what Emmanuel was/is running into:
https://lkml.org/lkml/2013/3/30/1

 
  ...if this says sth. to you.
 
  [ CC folks from below thread ]
 
  I have found a thread called Re: ipc,sem: sysv semaphore scalability
  on LKML with a screenshot that shows the same call-trace.
  I followed it a bit.
  There is a patch in [3]... unconfirmed.
 
  Comments on the rcu read-lock and sem_lock() vs sem_unlock() from Linus.
 
  What's the status of this discussion?
 
  - Sedat -
 
  [1] https://lkml.org/lkml/2013/3/30/6
  [2] http://i.imgur.com/uk6gmq1.jpg
  [3] https://lkml.org/lkml/2013/3/31/12
  [4] https://lkml.org/lkml/2013/3/31/77
 
 I am at a conference right now, but when I get
 back I will check linux-next vs. all the fixes from
 the semaphore scalability email thread.

I'm back from the collab. summit, so AFAICT these still need to go in
linux-next:

ipc,sem: untangle RCU locking with find_alloc_undo:
https://lkml.org/lkml/2013/3/28/275

ipc,sem: fix lockdep false positive:
https://lkml.org/lkml/2013/3/29/119

ipc, sem: do not call sem_lock when bogus sma:
https://lkml.org/lkml/2013/3/31/12

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Apr 18 [ call-trace: drm | x86 | smp | rcu related? ]

2013-04-20 Thread Davidlohr Bueso
On Sat, 2013-04-20 at 02:19 +0200, Sedat Dilek wrote:
 On Sat, Apr 20, 2013 at 2:06 AM, Sedat Dilek sedat.di...@gmail.com wrote:
  On Sat, Apr 20, 2013 at 1:02 AM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  On Fri, Apr 19, 2013 at 3:55 PM, Sedat Dilek sedat.di...@gmail.com wrote:
 
  Davidlohr pointed to this patch (tested the triplet):
 
  ipc, sem: do not call sem_lock when bogus sma:
  https://lkml.org/lkml/2013/3/31/12
 
  Is that what you mean?
 
  Yup.
 
 
  Davidlohr Bueso (1):
ipc, sem: do not call sem_lock when bogus sma
 
  Linus Torvalds (1):
crazy rcu double free debug hack
 
  With ***both*** patches applied I am able to build a Linux-kernel with
  4 parallel-make-jobs again.
  David's or your patch alone are not sufficient!
 
 
 [ Still both patches applied ]
 
 To correct myself... The 1st run was OK.
 
 The 2nd run shows a NULL-pointer-deref (excerpt):
 
 [  178.490583] BUG: spinlock bad magic on CPU#1, sh/8066
 [  178.490595]  lock: 0x88008b53ea18, .magic: 6b6b6b6b, .owner:
 make/8068, .owner_cpu: 3
 [  178.490599] BUG: unable to handle kernel NULL pointer dereference
 at   (null)
 [  178.490608] IP: [812bacd0] update_queue+0x70/0x210
 [  178.490610] PGD 0
 [  178.490612] Oops:  [#1] SMP
 ...

The exit_sem()  do_smart_update()  update_queue() calls seem pretty
well protected. Furthermore we're asserting that sma-sem_perm.lock is
taken. This could just be a consequence of another issue. Earlier this
week Andrew pointed out a potential race in semctl_main() where
sma-sem_perm.deleted could be changed when cmd == GETALL.

Sedat, could you try the attached patch to keep the ipc lock acquired
(on top of the three patches you're already using) and let us know how
it goes? We could also just have the RCU read lock instead of
-sem.perm.lock for GETALL, but lets play it safe for now.

Thanks,
Davidlohr

diff --git a/ipc/sem.c b/ipc/sem.c
index 5711616..1dfc3c1 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,10 +1243,11 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 err = -EIDRM;
 goto out_free;
 			}
-			sem_unlock(sma, -1);
 		}
 
-		sem_lock(sma, NULL, -1);
+		/* has the ipc lock already been taken? */
+		if(nsems = SEMMSL_FAST)
+			sem_lock(sma, NULL, -1);
 		for (i = 0; i  sma-sem_nsems; i++)
 			sem_io[i] = sma-sem_base[i].semval;
 		sem_unlock(sma, -1);


Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo

2013-03-28 Thread Davidlohr Bueso
On Thu, 2013-03-28 at 11:32 -0400, Rik van Riel wrote:
 On Tue, 26 Mar 2013 13:33:07 -0400
 Sasha Levin sasha.le...@oracle.com wrote:
 
  [   96.347341] 
  [   96.348085] [ BUG: lock held when returning to user space! ]
  [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
 W
  [   96.360300] 
  [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
  held!
  [   96.362019] 1 lock held by trinity-child9/7583:
  [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
  SYSC_semtimedop+0x1fb/0xec0
  
  It seems that we can leave semtimedop without releasing the rcu read lock.
 
 Sasha, this patch untangles the RCU locking with find_alloc_undo,
 and should fix the above issue. As a side benefit, this makes the
 code a little cleaner.
 
 Next up: implement locking in a way that does not trigger any 
 lockdep warnings...
 
 ---8---
 
 Subject: ipc,sem: untangle RCU locking with find_alloc_undo
 
 The ipc semaphore code has a nasty RCU locking tangle, with both
 find_alloc_undo and semtimedop taking the rcu_read_lock(). The
 code can be cleaned up somewhat by only taking the rcu_read_lock
 once.

indeed!

 
 The only caller of find_alloc_undo is in semtimedop.
 
 This should solve the trinity issue reported by Sasha Levin.
 
 Reported-by: Sasha Levin sasha.le...@oracle.com
 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  ipc/sem.c |   31 +--
  1 files changed, 9 insertions(+), 22 deletions(-)
 
 diff --git a/ipc/sem.c b/ipc/sem.c
 index f46441a..2ec2945 100644
 --- a/ipc/sem.c
 +++ b/ipc/sem.c
 @@ -1646,22 +1646,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
 __user *, tsops,
   alter = 1;
   }
  
 + INIT_LIST_HEAD(tasks);
 +
   if (undos) {
 + /* On success, find_alloc_undo takes the rcu_read_lock */
   un = find_alloc_undo(ns, semid);

find_alloc_undo() has some nested rcu_read_locks of its own. We can
simplify that as well. Will look into it, but don't want to introduce
any more changes until we address all the issues with the patchset, and
know it to behave.

   if (IS_ERR(un)) {
   error = PTR_ERR(un);
   goto out_free;
   }
 - } else
 + } else {
   un = NULL;
 + rcu_read_lock();
 + }
  
 - INIT_LIST_HEAD(tasks);
 -
 - rcu_read_lock();
   sma = sem_obtain_object_check(ns, semid);
   if (IS_ERR(sma)) {
 - if (un)
 - rcu_read_unlock();
 + rcu_read_unlock();
   error = PTR_ERR(sma);
   goto out_free;
   }
 @@ -1693,22 +1694,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
 __user *, tsops,
*/
   error = -EIDRM;
   locknum = sem_lock(sma, sops, nsops);
 - if (un) {
 - if (un-semid == -1) {
 - rcu_read_unlock();
 - goto out_unlock_free;
 - } else {
 - /*
 -  * rcu lock can be released, un cannot disappear:
 -  * - sem_lock is acquired, thus IPC_RMID is
 -  *   impossible.
 -  * - exit_sem is impossible, it always operates on
 -  *   current (or a dead task).
 -  */
 -
 - rcu_read_unlock();
 - }
 - }
 + if (un  un-semid == -1)
 + goto out_unlock_free;

Yeah, I was tempted in doing something much like this, but didn't want
to change any existing logic. Hopefully we can get away with this and it
fixes Sasha's issue.

  
   error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
   if (error = 0) {

Reviewed-by: Davidlohr Bueso davidlohr.bu...@hp.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] rbtree_test: add extra rbtree integrity check

2013-03-29 Thread Davidlohr Bueso
Account for the rbtree having  2**bh(v)-1 internal nodes.

While this can be seen as a consequence of other checks, Michel states
that it nicely sums up what the other properties are for.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 lib/rbtree_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index af38aed..9951503 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -117,8 +117,7 @@ static int black_path_count(struct rb_node *rb)
 static void check(int nr_nodes)
 {
struct rb_node *rb;
-   int count = 0;
-   int blacks = 0;
+   int count = 0, blacks = 0;
u32 prev_key = 0;
 
for (rb = rb_first(root); rb; rb = rb_next(rb)) {
@@ -134,7 +133,9 @@ static void check(int nr_nodes)
prev_key = node-key;
count++;
}
+
WARN_ON_ONCE(count != nr_nodes);
+   WARN_ON_ONCE(count  (1  black_path_count(rb_last(root))) - 1);
 }
 
 static void check_augmented(int nr_nodes)
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] rbtree_test: add __init/__exit annotations

2013-03-29 Thread Davidlohr Bueso
Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 lib/rbtree_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index 9951503..122f02f 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -149,7 +149,7 @@ static void check_augmented(int nr_nodes)
}
 }
 
-static int rbtree_test_init(void)
+static int __init rbtree_test_init(void)
 {
int i, j;
cycles_t time1, time2, time;
@@ -222,7 +222,7 @@ static int rbtree_test_init(void)
return -EAGAIN; /* Fail will directly unload the module */
 }
 
-static void rbtree_test_exit(void)
+static void __exit rbtree_test_exit(void)
 {
printk(KERN_ALERT test exit\n);
 }
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso
On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote:
 Hi Linus,
 
 On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  Emmanuel, can you try the attached patch? I think it applies cleanly
  on top of the scalability series too without any changes, but I didn't
  check if the patches perhaps changed some of the naming or something.
 
 I had to slightly modify the patch since it wouldn't match the changes
 introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
 hope that was the right thing to do. So, what I tried was: original 7
 patches + the one liner + your patch blindly modified by me on the top
 of 3.9-rc4 and I'm still having twilight zone issues.

Not sure which one liner you refer to, but, if you haven't already done
so, please try with these fixes (queued in linux-next):

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31

I've been trying to reproduce your twilight zone problem on five
different machines now without any luck. Is there anything you're doing
to trigger the issue? Does the machine boot ok and then do weird things,
say after X starts, open some program, etc?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso
On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote:
 On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty benist...@gmail.com wrote:
 
  I had to slightly modify the patch since it wouldn't match the changes
  introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
  hope that was the right thing to do. So, what I tried was: original 7
  patches + the one liner + your patch blindly modified by me on the top
  of 3.9-rc4 and I'm still having twilight zone issues.
 
 Ok, please send your patch so that I can double-check what you did,
 but it was simple enough that you probably did the right thing.
 
 Sad. Your case definitely looks like a double rcu-free, as shown by
 the fact that when you enabled SLUB debugging the oops happened with
 the use-after-free pattern (it's __rcu_reclaim() doing the
 head-func(head); thing, and func is 0x6b6b6b6b6b6b6b6b, so head
 has already been free'd once).
 
 So ipc_rcu_putref() and a refcounting error looked very promising.as a
 potential explanation.
 
 The 'un' undo structure is also free'd with rcu, but the locking
 around that seems much more robust. The undo entry is on two lists
 (sma-list_id, under sma-sem_perm.lock and ulp-list_proc, under
 ulp-lock). But those locks are actually tested with
 assert_spin_locked() in all the relevant places, and the code actually
 looks sane. So I had high hopes for ipc_rcu_putref()...
 
 Hmm. Except for exit_sem() that does odd things. You have preemption
 enabled, don't you? exit_sem() does a lookup of the first list_proc
 entry under tcy_read_lock to lookup un-semid, and then it drops the
 rcu read lock. At which point un is no longer reliable, I think. But
 then it still uses un-semid, rather than the stable value it looked
 up under the rcu read lock. Which looks bogus.
 
 So I'd like you to test a few more things:
 
  (a) In exit_sem(), can you change the
 
  sma = sem_lock_check(tsk-nsproxy-ipc_ns, un-semid);
 
  to use just semid rather than un-semid, because I don't
 think un is stable here.

Well that's not really the case in the new code. We don't drop the rcu
read lock until the end of the loop, in sem_unlock(). However, I just
noticed that we're checking sma for error after trying to acquire
sma-sem_perm.lock:

sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid);
sem_lock(sma, NULL, -1);

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

The IS_ERR(sma) check should be right after the sem_obtain_object_check() call 
instead.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo

2013-03-30 Thread Davidlohr Bueso
On Sat, 2013-03-30 at 21:30 -0400, Rik van Riel wrote:
 On 03/30/2013 09:35 AM, Sasha Levin wrote:
 
  I'm thinking that the solution is as simple as:
 
 Your patch is absolutely correct.  All it needs now is your
 signed-off-by, so Andrew can merge it into -mm :)
 
 Reviewed-by: Rik van Riel r...@redhat.com

Reviewed-by: Davidlohr Bueso davidlohr.bu...@hp.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Davidlohr Bueso
On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote:
 On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty benist...@gmail.com 
  wrote:
 
  Then I start building a random package and the problems start. They
  may also happen without compiling but this seems to trigger the bug
  quite quickly.
 
  I suspect it's about preemption, and the build just results in enough
  scheduling load that you start hitting whatever race there is.
 
  Anyway, some progress here, I hope: dmesg seems to be
  willing to reveal some secrets (using some pastebin service since this
  is pretty big):
 
  https://gist.github.com/anonymous/5275120
 
  That looks like exactly the exit_sem() bug that Davidlohr was talking
  about, where the
 
  /* exit_sem raced with IPC_RMID, nothing to do */
  if (IS_ERR(sma))
  continue;
 
  should be moved to *before* the
 
  sem_lock(sma, NULL, -1);
 
  call. And apparently the bug I had found is already fixed in -next.
 
 I just tried the 7 original patches + the 2 one liners from -next +
 modified Linus' patch (attached) on the top of 3.9-rc4 using
 PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
 above. I was building two packages at the same time, went away for 30
 seconds, came back and everything froze as soon as I touched the
 laptop's touchpad. Maybe a coincidence but anyway... Another shot in
 the dark, I had this weird message when trying to build gcc:
 semop(2): encountered an error: Identifier removed

*sigh*. I had high hopes for this being the bug triggering your issue,
specially after seeing exit_sem() in the trace. 

Emmanuel, just to be sure, does your changes reflect the patch below?
Specially dropping the rcu read lock before the continue statement
(sorry for not mentioning this in the last email).

Anyway, this is still a bug. Andrew, the patch below applies to
linux-next, please queue this up if you don't have any objections. 

Thanks,
Davidlohr

---8---
From: Davidlohr Bueso davidlohr.bu...@hp.com
Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma

In exit_sem() we attempt to acquire the sma-sem_perm.lock by calling
sem_lock() immediately after obtaining sma. However, if sma isn't valid,
then calling sem_lock() will tend to do bad things.

Move the sma error check right after the sem_obtain_object_check() call instead.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/sem.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;
 
rcu_read_lock();
un = list_entry_rcu(ulp-list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}
 
sma = sem_obtain_object_check(tsk-nsproxy-ipc_ns, un-semid);
-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }
 
+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-18 Thread Davidlohr Bueso
This provides nicer message output. Since it seems more appropriate
for the nature of this module, also use KERN_INFO instead of other
levels.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 lib/rbtree_test.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index af38aed..66ca26d 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -1,3 +1,6 @@
+#define KMSG_COMPONENT rbtree_test
+#define pr_fmt(fmt) KMSG_COMPONENT :  fmt
+
 #include linux/module.h
 #include linux/rbtree_augmented.h
 #include linux/random.h
@@ -153,7 +156,7 @@ static int rbtree_test_init(void)
int i, j;
cycles_t time1, time2, time;
 
-   printk(KERN_ALERT rbtree testing);
+   pr_info(rbtree testing);
 
prandom_seed_state(rnd, 3141592653589793238ULL);
init();
@@ -171,7 +174,7 @@ static int rbtree_test_init(void)
time = time2 - time1;
 
time = div_u64(time, PERF_LOOPS);
-   printk( - %llu cycles\n, (unsigned long long)time);
+   pr_info( - %llu cycles\n, (unsigned long long)time);
 
for (i = 0; i  CHECK_LOOPS; i++) {
init();
@@ -186,7 +189,7 @@ static int rbtree_test_init(void)
check(0);
}
 
-   printk(KERN_ALERT augmented rbtree testing);
+   pr_info(augmented rbtree testing);
 
init();
 
@@ -203,7 +206,7 @@ static int rbtree_test_init(void)
time = time2 - time1;
 
time = div_u64(time, PERF_LOOPS);
-   printk( - %llu cycles\n, (unsigned long long)time);
+   pr_info( - %llu cycles\n, (unsigned long long)time);
 
for (i = 0; i  CHECK_LOOPS; i++) {
init();
@@ -223,7 +226,7 @@ static int rbtree_test_init(void)
 
 static void rbtree_test_exit(void)
 {
-   printk(KERN_ALERT test exit\n);
+   pr_info(test exit\n);
 }
 
 module_init(rbtree_test_init)
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] rbtree_test: add more rbtree integrity checks

2013-03-18 Thread Davidlohr Bueso
When checking the rbtree, account for more properties:

   - Both children of a red node are black.
   - The tree has at least 2**bh(v)-1 internal nodes.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 lib/rbtree_test.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index 0fea14e..4c84f85 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -106,7 +106,7 @@ static void init(void)
 
 static bool is_red(struct rb_node *rb)
 {
-   return !(rb-__rb_parent_color  1);
+   return rb ? !(rb-__rb_parent_color  RB_BLACK) : 0;
 }
 
 static int black_path_count(struct rb_node *rb)
@@ -120,24 +120,38 @@ static int black_path_count(struct rb_node *rb)
 static void check(int nr_nodes)
 {
struct rb_node *rb;
-   int count = 0;
-   int blacks = 0;
+   int blacks = 0, count = 0;
u32 prev_key = 0;
 
for (rb = rb_first(root); rb; rb = rb_next(rb)) {
struct test_node *node = rb_entry(rb, struct test_node, rb);
+
+   /* sorted keys */
WARN_ON_ONCE(node-key  prev_key);
-   WARN_ON_ONCE(is_red(rb) 
-(!rb_parent(rb) || is_red(rb_parent(rb;
+
+   if (is_red(rb)) {
+   /*
+* root must be black and no path contains two
+* consecutive red nodes.
+*/
+   WARN_ON_ONCE(!rb_parent(rb) || is_red(rb_parent(rb)));
+
+   /* both children of a red node are black */
+   WARN_ON_ONCE(is_red(rb-rb_left) || 
is_red(rb-rb_right));
+   }
+
if (!count)
blacks = black_path_count(rb);
else
WARN_ON_ONCE((!rb-rb_left || !rb-rb_right) 
 blacks != black_path_count(rb));
+
prev_key = node-key;
count++;
}
+
WARN_ON_ONCE(count != nr_nodes);
+   WARN_ON_ONCE(count  (1  black_path_count(rb_last(root))) - 1);
 }
 
 static void check_augmented(int nr_nodes)
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] rbtree_test: add __init/__exit annotations

2013-03-18 Thread Davidlohr Bueso
Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 lib/rbtree_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index 66ca26d..0fea14e 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -151,7 +151,7 @@ static void check_augmented(int nr_nodes)
}
 }
 
-static int rbtree_test_init(void)
+static int __init rbtree_test_init(void)
 {
int i, j;
cycles_t time1, time2, time;
@@ -224,7 +224,7 @@ static int rbtree_test_init(void)
return -EAGAIN; /* Fail will directly unload the module */
 }
 
-static void rbtree_test_exit(void)
+static void __exit rbtree_test_exit(void)
 {
pr_info(test exit\n);
 }
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-18 Thread Davidlohr Bueso
On Mon, 2013-03-18 at 16:44 -0700, Joe Perches wrote:
 On Mon, 2013-03-18 at 16:20 -0700, Davidlohr Bueso wrote:
  This provides nicer message output. Since it seems more appropriate
  for the nature of this module, also use KERN_INFO instead of other
  levels.
 []
  diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
 []
  @@ -153,7 +156,7 @@ static int rbtree_test_init(void)
  int i, j;
  cycles_t time1, time2, time;
   
  -   printk(KERN_ALERT rbtree testing);
  +   pr_info(rbtree testing);
   
  prandom_seed_state(rnd, 3141592653589793238ULL);
  init();
  @@ -171,7 +174,7 @@ static int rbtree_test_init(void)
  time = time2 - time1;
   
  time = div_u64(time, PERF_LOOPS);
  -   printk( - %llu cycles\n, (unsigned long long)time);
  +   pr_info( - %llu cycles\n, (unsigned long long)time);
 
 You change the output here by more than just adding a prefix.
 
 The first printk didn't have a newline.
 
 The old code would print:
 
 rbtree testing - foo cycles
 
 The new code prints:
 
 rbtree_test: rbtree testing
 rbtree_test: - foo cycles
 

Ah, I see. This is actually the first time I'm using pr_* calls. I
actually don't mind the new format, it looked
ok, but if others don't agree, I can always resend it.

 btw: each pr_info should have a newline termination
 or be followed by a some number of pr_cont/printk with
 the last one having a terminating newline.
 
 The first pr_info here doesn't have a newline
 so it's possible (though unlikely) that another
 thread could have its output appended/interleaved
 on the same line.
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-19 Thread Davidlohr Bueso
On Tue, 2013-03-19 at 10:29 -0600, Shuah Khan wrote:
 On Mon, Mar 18, 2013 at 5:20 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  This provides nicer message output. Since it seems more appropriate
  for the nature of this module, also use KERN_INFO instead of other
  levels.
 
 Why are you changing the ALERTs to INFO?

Because of the nature of the messages. They don't justify having a
KERN_ALERT level (requiring immediate attention), and it seems a lot
more suitable to use INFO instead.

 
 
  Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
  ---
   lib/rbtree_test.c | 13 -
   1 file changed, 8 insertions(+), 5 deletions(-)
 
  diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
  index af38aed..66ca26d 100644
  --- a/lib/rbtree_test.c
  +++ b/lib/rbtree_test.c
  @@ -1,3 +1,6 @@
  +#define KMSG_COMPONENT rbtree_test
  +#define pr_fmt(fmt) KMSG_COMPONENT :  fmt
  +
   #include linux/module.h
   #include linux/rbtree_augmented.h
   #include linux/random.h
  @@ -153,7 +156,7 @@ static int rbtree_test_init(void)
  int i, j;
  cycles_t time1, time2, time;
 
  -   printk(KERN_ALERT rbtree testing);
  +   pr_info(rbtree testing);
 
 This is changing the output from KERN_ALERT to KERN_INFO. Why is this
 necessary? Should this be pr_alert() instead?
 
 
 
  prandom_seed_state(rnd, 3141592653589793238ULL);
  init();
  @@ -171,7 +174,7 @@ static int rbtree_test_init(void)
  time = time2 - time1;
 
  time = div_u64(time, PERF_LOOPS);
  -   printk( - %llu cycles\n, (unsigned long long)time);
  +   pr_info( - %llu cycles\n, (unsigned long long)time);
 
  for (i = 0; i  CHECK_LOOPS; i++) {
  init();
  @@ -186,7 +189,7 @@ static int rbtree_test_init(void)
  check(0);
  }
 
  -   printk(KERN_ALERT augmented rbtree testing);
  +   pr_info(augmented rbtree testing);
 
 This is changing the output from KERN_ALERT to KERN_INFO. Why is this
 necessary? Should this be pr_alert() instead?
 
 
  init();
 
  @@ -203,7 +206,7 @@ static int rbtree_test_init(void)
  time = time2 - time1;
 
  time = div_u64(time, PERF_LOOPS);
  -   printk( - %llu cycles\n, (unsigned long long)time);
  +   pr_info( - %llu cycles\n, (unsigned long long)time);
 
  for (i = 0; i  CHECK_LOOPS; i++) {
  init();
  @@ -223,7 +226,7 @@ static int rbtree_test_init(void)
 
   static void rbtree_test_exit(void)
   {
  -   printk(KERN_ALERT test exit\n);
  +   pr_info(test exit\n);
 
 This is changing the output from KERN_ALERT to KERN_INFO. Why is this
 necessary? Should this be pr_alert() instead?
 
   }
 
   module_init(rbtree_test_init)
  --
  1.7.11.7
 
 
 
 
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 13:49 -0700, Linus Torvalds wrote:
 On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel r...@surriel.com wrote:
 
  This series makes the sysv semaphore code more scalable,
  by reducing the time the semaphore lock is held, and making
  the locking more scalable for semaphore arrays with multiple
  semaphores.
 
 The series looks sane to me, and I like how each individual step is
 pretty small and makes sense.
 
 It *would* be lovely to see this run with the actual Swingbench
 numbers. The microbenchmark always looked much nicer. Do the
 additional multi-semaphore scalability patches on top of Davidlohr's
 patches help with the swingbench issue, or are we still totally
 swamped by the ipc lock there?

Yes, I'm testing this patchset with my swingbench workloads. I should
have some numbers by today or tomorrow.

 
 Maybe there were already numbers for that, but the last swingbench
 numbers I can actually recall was from before the finer-grained
 locking..

Right, I couldn't get Oracle to run on the with the previous patches,
hopefully the bug(s) are now addressed.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
 Include lkml in the CC: this time... *sigh*
 ---8---
 
 This series makes the sysv semaphore code more scalable,
 by reducing the time the semaphore lock is held, and making
 the locking more scalable for semaphore arrays with multiple
 semaphores.
 
 The first four patches were written by Davidlohr Buesso, and
 reduce the hold time of the semaphore lock.
 
 The last three patches change the sysv semaphore code locking
 to be more fine grained, providing a performance boost when
 multiple semaphores in a semaphore array are being manipulated
 simultaneously.
 
 On a 24 CPU system, performance numbers with the semop-multi
 test with N threads and N semaphores, look like this:
 
   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
 threads   patches rwlock patches  v3 patches
 10610652  726325  1783589 2142206
 20341570  365699  1520453 1977878
 30288102  307037  1498167 2037995
 40290714  305955  1612665 2256484
 50288620  312890  1733453 2650292
 60289987  306043  1649360 2388008
 70291298  306347  1723167 2717486
 80290948  305662  1729545 2763582
 90290996  306680  1736021 2757524
 100   292243  306700  1773700 3059159
 

After testing these patches with my Oracle Swingbench DSS workload, I
can say that there are significant improvements. The ipc lock contention
was reduced drastically, specially with higher amounts of benchmark
users. As a result, the overall %sys time went down as well.
Furthermore, throughput (in transactions per second) was increased.

TPS:
100 users: 1257.21 (vanilla)2805.06 (v3 patchset)
400 users: 1437.57 (vanilla)2664.67 (v3 patchset)
800 users: 1236.89 (vanilla)2750.73 (v3 patchset)

ipc lock contention:
100 users:  8,74%  (vanilla)3.17% (v3 patchset)
400 users:  21,86% (vanilla)5.23% (v3 patchset)
800 users   84,35% (vanilla)7.39% (v3 patchset) 

As seen with perf, the ipc lock isn't even the main source of contention
anymore. Also, no matter how many benchmark users,  the lock's user is
mostly semctl_main() .

100 users:
3.17%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--50.53%-- sem_lock
|  |  
|  |--82.60%-- semctl_main
|   --17.40%-- sys_semtimedop

400 users:
5.23%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--75.81%-- sem_lock
|  |  
|  |--94.09%-- semctl_main
|   --5.91%-- sys_semtimedop


800 users:
 7.39%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock 
   
 |
 --- _raw_spin_lock
|  
|--81.71%-- sem_lock
|  |  
|  |--64.98%-- semctl_main
|   --35.02%-- sys_semtimedop


Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] ipc,sem: open code and rename sem_lock

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
 Rename sem_lock to sem_obtain_lock, so we can introduce a sem_lock
 function later that only locks the sem_array and does nothing else.
 
 Open code the locking from ipc_lock in sem_obtain_lock, so we can
 introduce finer grained locking for the sem_array in the next patch.
 
 Signed-off-by: Rik van Riel r...@redhat.com

Acked-by: Davidlohr Bueso davidlohr.bu...@hp.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] ipc,sem: have only one list in struct sem_queue

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
 Having only one list in struct sem_queue, and only queueing simple
 semaphore operations on the list for the semaphore involved, allows
 us to introduce finer grained locking for semtimedop.
 
 Signed-off-by: Rik van Riel r...@redhat.com

Acked-by: Davidlohr Bueso davidlohr.bu...@hp.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/7] ipc,sem: fine grained locking for semtimedop

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
 Introduce finer grained locking for semtimedop, to handle the
 common case of a program wanting to manipulate one semaphore
 from an array with multiple semaphores.
 
 If the call is a semop manipulating just one semaphore in
 an array with multiple semaphores, only take the lock for
 that semaphore itself.
 
 If the call needs to manipulate multiple semaphores, or
 another caller is in a transaction that manipulates multiple
 semaphores, the sem_array lock is taken, as well as all the
 locks for the individual semaphores.
 
 On a 24 CPU system, performance numbers with the semop-multi
 test with N threads and N semaphores, look like this:
 
   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
 threads   patches rwlock patches  v3 patches
 10610652  726325  1783589 2142206
 20341570  365699  1520453 1977878
 30288102  307037  1498167 2037995
 40290714  305955  1612665 2256484
 50288620  312890  1733453 2650292
 60289987  306043  1649360 2388008
 70291298  306347  1723167 2717486
 80290948  305662  1729545 2763582
 90290996  306680  1736021 2757524
 100   292243  306700  1773700 3059159
 
 Signed-off-by: Rik van Riel r...@redhat.com
 Suggested-by: Linus Torvalds torva...@linux-foundation.org

Acked-by: Davidlohr Bueso davidlohr.bu...@hp.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-21 Thread Davidlohr Bueso
On Tue, 2013-03-19 at 11:54 -0600, Shuah Khan wrote:
 On Tue, Mar 19, 2013 at 11:14 AM, Davidlohr Bueso
 davidlohr.bu...@hp.com wrote:
  On Tue, 2013-03-19 at 10:29 -0600, Shuah Khan wrote:
  On Mon, Mar 18, 2013 at 5:20 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
  wrote:
   This provides nicer message output. Since it seems more appropriate
   for the nature of this module, also use KERN_INFO instead of other
   levels.
 
  Why are you changing the ALERTs to INFO?
 
  Because of the nature of the messages. They don't justify having a
  KERN_ALERT level (requiring immediate attention), and it seems a lot
  more suitable to use INFO instead.
 
 
 Hmm. I see interval_tree_test using the same alerts. It almost looks
 like the start and end of a test are meant to be alerts. I am not
 saying it shouldn't be changed, however looking for a stronger reason
 than it seems a lot more suitable to use INFO instead. Are there any
 use-cases in which KERN_ALERTs cause problems?
 

No 'issue' particularly, just common sense. In any case I have no
problem reverting the changes back to KERN_ALERT, no big deal.

Andrew, Michel, do you have any preferences? I'm mostly interested in
patch 3/3, do you have any objections?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
 Include lkml in the CC: this time... *sigh*
 ---8---
 
 This series makes the sysv semaphore code more scalable,
 by reducing the time the semaphore lock is held, and making
 the locking more scalable for semaphore arrays with multiple
 semaphores.
 
 The first four patches were written by Davidlohr Buesso, and
 reduce the hold time of the semaphore lock.
 
 The last three patches change the sysv semaphore code locking
 to be more fine grained, providing a performance boost when
 multiple semaphores in a semaphore array are being manipulated
 simultaneously.
 
 On a 24 CPU system, performance numbers with the semop-multi
 test with N threads and N semaphores, look like this:
 
   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
 threads   patches rwlock patches  v3 patches
 10610652  726325  1783589 2142206
 20341570  365699  1520453 1977878
 30288102  307037  1498167 2037995
 40290714  305955  1612665 2256484
 50288620  312890  1733453 2650292
 60289987  306043  1649360 2388008
 70291298  306347  1723167 2717486
 80290948  305662  1729545 2763582
 90290996  306680  1736021 2757524
 100   292243  306700  1773700 3059159
 

Some results with semop-multi on my 4 core laptop:

vanilla v3 patchset
threads
10   5094473 10289146
20   5079946 10187923
30   5041258 10660635
40   4942786 10876009
50   5076437 10759434
60   5139024 10797032
70   5103811 10698323
80   5094850  9959675
90   5085774 10054844
100  4939547  9798291




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rbtree_test: use pr_info for module prefix in messages

2013-03-22 Thread Davidlohr Bueso
On Thu, 2013-03-21 at 20:29 -0700, Michel Lespinasse wrote:
 On Thu, Mar 21, 2013 at 7:51 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  On Tue, 2013-03-19 at 11:54 -0600, Shuah Khan wrote:
  On Tue, Mar 19, 2013 at 11:14 AM, Davidlohr Bueso
  davidlohr.bu...@hp.com wrote:
   On Tue, 2013-03-19 at 10:29 -0600, Shuah Khan wrote:
   On Mon, Mar 18, 2013 at 5:20 PM, Davidlohr Bueso 
   davidlohr.bu...@hp.com wrote:
This provides nicer message output. Since it seems more appropriate
for the nature of this module, also use KERN_INFO instead of other
levels.
  
   Why are you changing the ALERTs to INFO?
  
   Because of the nature of the messages. They don't justify having a
   KERN_ALERT level (requiring immediate attention), and it seems a lot
   more suitable to use INFO instead.
  
 
  Hmm. I see interval_tree_test using the same alerts. It almost looks
  like the start and end of a test are meant to be alerts. I am not
  saying it shouldn't be changed, however looking for a stronger reason
  than it seems a lot more suitable to use INFO instead. Are there any
  use-cases in which KERN_ALERTs cause problems?
 
 
  No 'issue' particularly, just common sense. In any case I have no
  problem reverting the changes back to KERN_ALERT, no big deal.
 
  Andrew, Michel, do you have any preferences? I'm mostly interested in
  patch 3/3, do you have any objections?
 
 Sorry for the late reply - I have a lot of upstream email to catch up to.
 
 No objection to the change but I also have to say I'm not quite sure
 what's the motivation - it'd be easier if you had a 0/3 mail to
 explain the issue. In particular, I'm not sure if you've been trying
 to use the test compiled in rather than as a module (which is all I've
 ever built it as myself :)
 

Yeah, since it was a small and straightforward patchset I chose not to
send a 0/3 explaining the motivation. I was basically going through your
augmented rbtree work and noticed some property checks missing. FWIW I
only used it as a module as well.

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] rbtree_test: add more rbtree integrity checks

2013-03-22 Thread Davidlohr Bueso
On Thu, 2013-03-21 at 20:36 -0700, Michel Lespinasse wrote:
 On Mon, Mar 18, 2013 at 4:21 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  When checking the rbtree, account for more properties:
 
 - Both children of a red node are black.
 - The tree has at least 2**bh(v)-1 internal nodes.
 
  -   WARN_ON_ONCE(is_red(rb) 
  -(!rb_parent(rb) || is_red(rb_parent(rb;
  +
  +   if (is_red(rb)) {
  +   /*
  +* root must be black and no path contains two
  +* consecutive red nodes.
  +*/
  +   WARN_ON_ONCE(!rb_parent(rb) || 
  is_red(rb_parent(rb)));
  +
  +   /* both children of a red node are black */
  +   WARN_ON_ONCE(is_red(rb-rb_left) || 
  is_red(rb-rb_right));
  +   }
 
 This seems quite redundant with the previous test - if we're going to
 visit each children, then at that point we're going to check that they
 can't be black if their parent (the current node) is black. So I don't
 see that the tests adds any coverage.

Hmm ok I see your point. I'll drop this test and just keep the last one.

Thanks for taking a look,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso
On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
 On 03/20/2013 03:55 PM, Rik van Riel wrote:
  This series makes the sysv semaphore code more scalable,
  by reducing the time the semaphore lock is held, and making
  the locking more scalable for semaphore arrays with multiple
  semaphores.
 
 Hi Rik,
 
 Another issue that came up is:
 
 [   96.347341] 
 [   96.348085] [ BUG: lock held when returning to user space! ]
 [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
  W
 [   96.360300] 
 [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
 held!
 [   96.362019] 1 lock held by trinity-child9/7583:
 [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [8192eafb] 
 SYSC_semtimedop+0x1fb/0xec0
 
 It seems that we can leave semtimedop without releasing the rcu read lock.
 
 I'm a bit confused by what's going on in semtimedop with regards to rcu read 
 lock, it
 seems that this behaviour is actually intentional?
 
 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }
 
 When I've looked at that it seems that not releasing the read lock was (very)
 intentional.

This logic was from the original code, which I also found to be quite
confusing.

 
 After that, the only code path that would release the lock starts with:
 
 if (un) {
   ...
 
 So we won't release the lock at all if un is NULL?
 

Not necessarily, we do release everything at the end of the function: 

out_unlock_free:
sem_unlock(sma, locknum);

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso
On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
 On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  And you never see this problem without Rik's patches?
 
 No, never.
 
  Could you bisect
  *which* patch it starts with? Are the first four ones ok (the moving
  of the locking around, but without the fine-grained ones), for
  example?
 
 With the first four patches only, I got some X server freeze (just tried 
 once).

Going over the code again, I found a potential recursive spinlock scenario. 
Andrew, if you have no objections, please queue this up.

Thanks.

---8---

From: Davidlohr Bueso davidlohr.bu...@hp.com
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma-sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/sem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 1a2913d..f257afe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
err = -EIDRM;
goto out_free;
}
+   sem_unlock(sma, -1);
}
 
sem_lock(sma, NULL, -1);
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ipc: semaphores: do not hold ipc lock more than necessary

2013-03-02 Thread Davidlohr Bueso
On Sat, 2013-03-02 at 12:41 +0800, Michel Lespinasse wrote:
 On Sat, Mar 2, 2013 at 8:16 AM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  Instead of holding the ipc lock for permissions and security
  checks, among others, only acquire it when necessary.
 
  Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 
 You got some really great test results on this; I think they deserve
 to be mentioned in the commit message.

Absolutely.

 
 Code looks fine to me otherwise, but I only had a quick look.
 
 Nice work!
 
 Acked-by: Michel Lespinasse wal...@google.com
 

Thanks for reviewing, Michel.

Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ipc: semaphores: do not hold ipc lock more than necessary

2013-03-02 Thread Davidlohr Bueso
On Fri, 2013-03-01 at 17:20 -0800, Linus Torvalds wrote:
 On Fri, Mar 1, 2013 at 4:16 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  +static inline struct sem_array *sem_obtain_object(struct ipc_namespace 
  *ns, int id)
  +{
  +   struct kern_ipc_perm *ipcp = ipc_obtain_object(sem_ids(ns), id);
  +
  +   if (IS_ERR(ipcp))
  +   return (struct sem_array *)ipcp;
 
 This should use ERR_CAST() to make it more obvious what's going on.
 
  +static inline struct sem_array *sem_obtain_object_check(struct 
  ipc_namespace *ns,
  +   int id)
  +{
  +   struct kern_ipc_perm *ipcp = ipc_obtain_object_check(sem_ids(ns), 
  id);
  +
  +   if (IS_ERR(ipcp))
  +   return (struct sem_array *)ipcp;
 
 Same here.

Ok

 
  +/*
  + * Call inside the rcu read section.
  + */
  +static inline void sem_getref(struct sem_array *sma)
  +{
  +   spin_lock((sma)-sem_perm.lock);
  +   ipc_rcu_getref(sma);
  +   ipc_unlock((sma)-sem_perm);
  +}
 
 This really makes me wonder if we shouldn't just use an atomic counter
 for refcount. But I guess that would be a separate patch.
 

Ah, yes indeed.

 But all the uses of refcount really look like the normal atomic ops
 migth be the right thing. Especially if we no longer expect to hold
 the lock most of the time.
 
  +   spin_lock(sma-sem_perm.lock);
 
 I really would almost want to make these things be ipc_lock_object()
 rather than an open-coded spinlock like this. But that's not a big
 deal.

Sure.

 
 Patch looks fine to me in general.
 

Thanks for taking a look!

Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] ipc: do not hold ipc lock more than necessary

2013-03-02 Thread Davidlohr Bueso
On Sat, 2013-03-02 at 15:35 +0700, Emmanuel Benisty wrote:
 On Sat, Mar 2, 2013 at 2:08 PM, Michel Lespinasse wal...@google.com wrote:
  On Sat, Mar 2, 2013 at 12:43 PM, Emmanuel Benisty benist...@gmail.com 
  wrote:
  Hi,
 
  On Sat, Mar 2, 2013 at 7:16 AM, Davidlohr Bueso davidlohr.bu...@hp.com 
  wrote:
  The following set of not-thoroughly-tested patches are based on the
  discussion of holding the ipc lock unnecessarily, such as for permissions
  and security checks:
 
  https://lkml.org/lkml/2013/2/28/540
 
  Patch 0/1: Introduces new functions, analogous to ipc_lock and 
  ipc_lock_check
  in the ipc utility code, allowing to obtain the ipc object without 
  holding the lock.
 
  Patch 0/2: Use the new functions and only acquire the ipc lock when 
  needed.
 
  Not sure how much a work in progress this is but my machine dies
  immediately when I start chromium, crappy mobile phone picture here:
  http://i.imgur.com/S0hfPz3.jpg
 
  We are missing the top of the trace there, so it's hard to be sure -
  however, this could well be caused by the if (!out) check (instead of
  if (IS_ERR(out)) that I noticed in patch 1/2.
 
 Merci Michel but unfortunately, I'm still getting the same issue.

Will try to reproduce (and further testing on other machines) and debug
later today.

Thanks for testing,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] ipc: do not hold ipc lock more than necessary

2013-03-02 Thread Davidlohr Bueso
On Fri, 2013-03-01 at 17:32 -0800, Linus Torvalds wrote:
 On Fri, Mar 1, 2013 at 4:16 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
 
  With Rik's semop-multi.c microbenchmark we can see the following
  results:
 
 Ok, that certainly looks very good.
 
  +  59.40%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
  +  17.47%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
 
 I had somewhat high expectations, but that's just better than I really
 hoped for. Not only is the percentage down, it's down for the case of
 a much smaller number of overall cycle cost, so it's a really big
 reduction in contention spinning.
 
 Of course, contention will come back and overwhelm you at *some*
 point, but it seems the patches certainly moved the really bad
 contention point out some way..
 
  +   6.14%a.out  [kernel.kallsyms]  [k] sys_semtimedop
  +  11.08%a.out  [kernel.kallsyms]  [k] sys_semtimedop
  While the _raw_spin_lock time is drastically reduced, others do increase.
  This results in an overall speedup of ~1.7x regarding ops/sec.
 
 Actually, the others don't really increase. Sure, the *percentages* go
 up, but that's just because it has to add up to 100% in the end. So
 it's not that you're moving costs from one place to another - the 1.7x
 speedup is the real reduction in costs, and then that 6.14% - 11.08%
 growth is really nothing but that (and yes, 1.7 x 6.14 really does
 get pretty close).
 
 So nothing really got slower, despite the percentages going up.
 
 Looks good to me. Of course, the *real* issue is if this is a win on
 real code too. And I bet it is, it just won't be quite as noticeable.
 But if anything, real code is likely to have less contention to begin
 with, because it has more things going on outside of the spinlocks. So
 it should see an improvement, but not nearly the kind of improvement
 you quote here.
 
 Although your 800-user swingbench numbers were pretty horrible, so
 maybe that case can improve by comparable amounts in the bad cases.
 

Absolutely, I'll be sure to try these changes with my Oracle workloads
and report with some numbers. This obviously still needs a lot of
testing.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] ipc: introduce obtaining a lockless ipc object

2013-03-02 Thread Davidlohr Bueso
On Sat, 2013-03-02 at 13:24 -0800, Linus Torvalds wrote:
 On Fri, Mar 1, 2013 at 4:16 PM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
  @@ -784,7 +806,7 @@ struct kern_ipc_perm *ipcctl_pre_down(struct 
  ipc_namespace *ns,
  int err;
 
  down_write(ids-rw_mutex);
  -   ipcp = ipc_lock_check(ids, id);
  +   ipcp = ipc_obtain_object_check(ids, id);
  if (IS_ERR(ipcp)) {
  err = PTR_ERR(ipcp);
  goto out_up;
  @@ -801,7 +823,7 @@ struct kern_ipc_perm *ipcctl_pre_down(struct 
  ipc_namespace *ns,
  return ipcp;
 
  err = -EPERM;
  -   ipc_unlock(ipcp);
  +   rcu_read_unlock();
   out_up:
  up_write(ids-rw_mutex);
  return ERR_PTR(err);
 
 Uhhuh. This is very buggy, and I think it's the reason for the later
 bugs that Emmanuel reported.

Yes, quite buggy. I was able to mess up three different machines with
this, and since semaphores aren't the only users of ipcctl_pre_down(),
it could explain the sys_shmctl() call in the trace Emmanuel reported. 

 
 In particular, the *non-error* case is buggy, where it in the middle
 of the function does
 
 return ipcp;
 
 for a successful lookup.
 
 It used to return a locked ipcp, now it no longer does. And you didn't
 change any of the callers, which still do the ipc_unlock() at the
 end.  So all the locking gets completely confused.
 

After updating the callers, [msgctl, semctl, shmctl]_down, to acquire
the lock for IPC_RMID and IPC_SET commands, I'm no longer seeing these
issues - so far on my regular laptop and two big boxes running my Oracle
benchmarks for a few hours. Something like below (yes, I will address
the open coded spin_lock calls):

@@ -1101,16 +1138,20 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
 
switch(cmd){
case IPC_RMID:
+   spin_lock(sma-sem_perm.lock);
freeary(ns, ipcp);
goto out_up;
case IPC_SET:
+   spin_lock(sma-sem_perm.lock);
err = ipc_update_perm(semid64.sem_perm, ipcp);
if (err)
goto out_unlock;
sma-sem_ctime = get_seconds();
break;
default:
+   rcu_read_unlock();
err = -EINVAL;
+   goto out_up;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-05 Thread Davidlohr Bueso
Hi,

The following set of patches are based on the discussion of holding the 
ipc lock unnecessarily, such as for permissions and security checks:

https://lkml.org/lkml/2013/2/28/540

Patch 1/4: Remove the bogus comment from ipc_checkid() requiring that
the ipc lock be held before calling it. Also simplify the function
return. This is a new patch, not present in the RFC.

Patch 2/4: Introduce functions to obtain the ipc object without holding
the lock. Two functions, ipc_obtain_object() and
ipc_obtained_object_check() are created, which are analogous to
ipc_lock() and ipc_lock_check(). This patch was acked by Michel
Lespinasse and reviewed by Chegu Vinod.

Patch 3/4: Introduce ipcctl_pre_down_nolock() function, which is a
lockless version of ipcctl_pre_down(). This function is common to sem,
msg and shm and does some common checking for IPC_RMID and IPC_SET
commands. The older version was kept but calls the lockless version
without breaking the semantics, and is hence transparent to users. This
was suggested by Linus. Once all users are updated, the
ipcctl_pre_down() function can be removed.

Patch 4/4: Use the new, lockless, functions introduced above to only
hold the ipc lock when necessary. The idea is simple: only check ipc
security and permissions within the rcu read region, *without* holding
the ipc lock. This patch was acked by Michel Lespinasse and reviewed by
Chegu Vinod.

Changes since v1 (RFC):
- Add patches 1 and 3.

- Patch 2: In ipc_lock(), instead of checking the return of
ipc_obtain_object_check() against NULL, use IS_ERR(). Suggested by
Michel Lespinasse.

- Patch 2,4: In order for the rcu read lock/unlock calls to be paired up
more obviously, force the user to call rcu_read_unlock *before* calling
ipc_obtain_object[_check](). Suggested by Michel Lespinasse.

- Patch 4: Return ERR_CAST() in sem_obtain_object[_check]() instead of a
cast to struct sem_array *. Suggested by Linus.

- Patch 4: Change open coded spin_lock calls to ipc_object_lock in
semaphore code. Suggested by Linus.

- Patch 4: Added a 'out_wakup' label to semctl_main() and semtimedop()
to return from the functions without having to call sem_unlock (and
hence spin_unlock) without having the lock held.

- More tests: For the past few days I've been running this patchset on
my own laptop, and a 2 and 8 socket machines running my Oracle
swinbbench workloads. I have not encountered any issues so far. The main
fix was suggested by Linus with the bogus ipcctl_pre_down() changes
without updating the callers.

Ok, some numbers...

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+  59.40%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+   6.14%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   3.84%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   3.64%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   2.06%a.out  [kernel.kallsyms]  [k] 
copy_user_enhanced_fast_string
+   1.86%a.out  [kernel.kallsyms]  [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+  18.54%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+  11.72%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   7.70%a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   6.58%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   6.54%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   4.71%a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check


2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

Thanks,
Davidlohr




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/4] ipc: remove bogus lock comment for ipc_checkid

2013-03-05 Thread Davidlohr Bueso
There is no reason to be holding the ipc lock while
reading ipcp-seq, hence remove misleading comment.

Also simplify the return value for the function.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/util.h | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/ipc/util.h b/ipc/util.h
index eeb79a1..ac1480a 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -150,14 +150,9 @@ static inline int ipc_buildid(int id, int seq)
return SEQ_MULTIPLIER * seq + id;
 }
 
-/*
- * Must be called with ipcp locked
- */
 static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
 {
-   if (uid / SEQ_MULTIPLIER != ipcp-seq)
-   return 1;
-   return 0;
+   return uid / SEQ_MULTIPLIER != ipcp-seq;
 }
 
 static inline void ipc_lock_by_ptr(struct kern_ipc_perm *perm)
-- 
1.7.11.7





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/4] ipc: sem: do not hold ipc lock more than necessary

2013-03-05 Thread Davidlohr Bueso
Instead of holding the ipc lock for permissions and security
checks, among others, only acquire it when necessary.

Some numbers

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+  59.40%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+   6.14%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   3.84%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   3.64%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   2.06%a.out  [kernel.kallsyms]  [k] 
copy_user_enhanced_fast_string
+   1.86%a.out  [kernel.kallsyms]  [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+  18.54%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+  11.72%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   7.70%a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   6.58%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   6.54%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   4.71%a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
Reviewed-by: Chegu Vinod chegu_vi...@hp.com
Acked-by: Michel Lespinasse wal...@google.com
CC: Rik van Riel r...@redhat.com
CC: Jason Low jason.l...@hp.com
CC: Emmanuel Benisty benist...@gmail.com
---
 ipc/sem.c  | 157 +++--
 ipc/util.h |   5 ++
 2 files changed, 115 insertions(+), 47 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1..f06a853 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -204,13 +204,34 @@ static inline struct sem_array *sem_lock(struct 
ipc_namespace *ns, int id)
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, 
int id)
+{
+   struct kern_ipc_perm *ipcp = ipc_obtain_object(sem_ids(ns), id);
+
+   if (IS_ERR(ipcp))
+   return ERR_CAST(ipcp);
+
+   return container_of(ipcp, struct sem_array, sem_perm);
+}
+
 static inline struct sem_array *sem_lock_check(struct ipc_namespace *ns,
int id)
 {
struct kern_ipc_perm *ipcp = ipc_lock_check(sem_ids(ns), id);
 
if (IS_ERR(ipcp))
-   return (struct sem_array *)ipcp;
+   return ERR_CAST(ipcp);
+
+   return container_of(ipcp, struct sem_array, sem_perm);
+}
+
+static inline struct sem_array *sem_obtain_object_check(struct ipc_namespace 
*ns,
+   int id)
+{
+   struct kern_ipc_perm *ipcp = ipc_obtain_object_check(sem_ids(ns), id);
+
+   if (IS_ERR(ipcp))
+   return ERR_CAST(ipcp);
 
return container_of(ipcp, struct sem_array, sem_perm);
 }
@@ -234,6 +255,16 @@ static inline void sem_putref(struct sem_array *sma)
ipc_unlock((sma)-sem_perm);
 }
 
+/*
+ * Call inside the rcu read section.
+ */
+static inline void sem_getref(struct sem_array *sma)
+{
+   spin_lock((sma)-sem_perm.lock);
+   ipc_rcu_getref(sma);
+   ipc_unlock((sma)-sem_perm);
+}
+
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
 {
ipc_rmid(sem_ids(ns), s-sem_perm);
@@ -842,18 +873,25 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
case SEM_STAT:
{
struct semid64_ds tbuf;
-   int id;
+   int id = 0;
+
+   memset(tbuf, 0, sizeof(tbuf));
 
if (cmd == SEM_STAT) {
-   sma = sem_lock(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
+   rcu_read_lock();
+   sma = sem_obtain_object(ns, semid);
+   if (IS_ERR(sma)) {
+   err = PTR_ERR(sma);
+   goto out_unlock;
+   }
id = sma-sem_perm.id;
} else {
-   sma = sem_lock_check(ns, semid);
-   if (IS_ERR(sma))
-   return PTR_ERR(sma);
-   id = 0;
+   rcu_read_lock();
+   sma

[PATCH v2 2/4] ipc: introduce obtaining a lockless ipc object

2013-03-05 Thread Davidlohr Bueso
Through ipc_lock() and therefore ipc_lock_check() we currently
return the locked ipc object. This is not necessary for all situations
an can, therefore, incur in unnecessary ipc lock contention.

Introduce, analogous, ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
Reviewed-by: Chegu Vinod chegu_vi...@hp.com
Acked-by: Michel Lespinasse wal...@google.com
---
 ipc/util.c | 71 ++
 ipc/util.h |  2 ++
 2 files changed, 60 insertions(+), 13 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 464a8ab..65c3d6c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -668,6 +668,28 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, struct 
ipc_perm *out)
 }
 
 /**
+ * ipc_obtain_object
+ * @ids: ipc identifier set
+ * @id: ipc id to look for
+ *
+ * Look for an id in the ipc ids idr and return associated ipc object.
+ *
+ * Call inside the RCU critical section.
+ * The ipc object is *not* locked on exit.
+ */
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id)
+{
+   struct kern_ipc_perm *out;
+   int lid = ipcid_to_idx(id);
+
+   out = idr_find(ids-ipcs_idr, lid);
+   if (!out)
+   return ERR_PTR(-EINVAL);
+
+   return out;
+}
+
+/**
  * ipc_lock - Lock an ipc structure without rw_mutex held
  * @ids: IPC identifier set
  * @id: ipc id to look for
@@ -680,27 +702,50 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, 
struct ipc_perm *out)
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
-   int lid = ipcid_to_idx(id);
-
+   
rcu_read_lock();
-   out = idr_find(ids-ipcs_idr, lid);
-   if (out == NULL) {
-   rcu_read_unlock();
-   return ERR_PTR(-EINVAL);
-   }
+   out = ipc_obtain_object(ids, id);
+   if (IS_ERR(out))
+   goto err1;
 
spin_lock(out-lock);
-   
+
/* ipc_rmid() may have already freed the ID while ipc_lock
 * was spinning: here verify that the structure is still valid
 */
-   if (out-deleted) {
-   spin_unlock(out-lock);
-   rcu_read_unlock();
-   return ERR_PTR(-EINVAL);
-   }
+   if (out-deleted)
+   goto err0;
 
return out;
+err0:
+   spin_unlock(out-lock);
+err1:
+   rcu_read_unlock();
+   return ERR_PTR(-EINVAL);
+}
+
+/**
+ * ipc_obtain_object_check
+ * @ids: ipc identifier set
+ * @id: ipc id to look for
+ *
+ * Similar to ipc_obtain_object() but also checks
+ * the ipc object reference counter.
+ *
+ * Call inside the RCU critical section.
+ * The ipc object is *not* locked on exit.
+ */
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id)
+{
+   struct kern_ipc_perm *out = ipc_obtain_object(ids, id);
+
+   if (IS_ERR(out))
+   goto out;
+
+   if (ipc_checkid(out, id))
+   return ERR_PTR(-EIDRM);
+out:
+   return out;
 }
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
diff --git a/ipc/util.h b/ipc/util.h
index ac1480a..bfc8d4e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -123,6 +123,7 @@ void ipc_rcu_getref(void *ptr);
 void ipc_rcu_putref(void *ptr);
 
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
@@ -168,6 +169,7 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 }
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
struct ipc_ops *ops, struct ipc_params *params);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-- 
1.7.11.7







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] ipc: introduce lockless pre_down ipcctl

2013-03-05 Thread Davidlohr Bueso
Various forms of ipc use the ipcctl_pre_down() function to
retrieve an ipc object and check permissions, mostly for IPC_RMID
and IPC_SET commands.

Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is maintained, yet modified to call the nolock version,
without affecting its semantics, thus transparent to all ipc callers.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
Suggested-by: Linus Torvalds torva...@linux-foundation.org
---
 ipc/util.c | 31 ++-
 ipc/util.h |  3 +++
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 65c3d6c..6a98e62 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -825,11 +825,28 @@ struct kern_ipc_perm *ipcctl_pre_down(struct 
ipc_namespace *ns,
  struct ipc64_perm *perm, int extra_perm)
 {
struct kern_ipc_perm *ipcp;
+
+   ipcp = ipcctl_pre_down_nolock(ns, ids, id, cmd, perm, extra_perm);
+   if (IS_ERR(ipcp))
+   goto out;
+
+   spin_lock(ipcp-lock);
+out:
+   return ipcp;
+}
+
+struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct ipc_ids *ids, int id, int 
cmd,
+struct ipc64_perm *perm, int 
extra_perm)
+{
kuid_t euid;
-   int err;
+   int err = -EPERM;
+   struct kern_ipc_perm *ipcp;
 
down_write(ids-rw_mutex);
-   ipcp = ipc_lock_check(ids, id);
+   rcu_read_lock();
+
+   ipcp = ipc_obtain_object_check(ids, id);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
goto out_up;
@@ -838,17 +855,21 @@ struct kern_ipc_perm *ipcctl_pre_down(struct 
ipc_namespace *ns,
audit_ipc_obj(ipcp);
if (cmd == IPC_SET)
audit_ipc_set_perm(extra_perm, perm-uid,
-perm-gid, perm-mode);
+  perm-gid, perm-mode);
 
euid = current_euid();
if (uid_eq(euid, ipcp-cuid) || uid_eq(euid, ipcp-uid)  ||
ns_capable(ns-user_ns, CAP_SYS_ADMIN))
return ipcp;
 
-   err = -EPERM;
-   ipc_unlock(ipcp);
 out_up:
+   /*
+* Unsuccessful lookup, unlock and return
+* the corresponding error.
+*/
+   rcu_read_unlock();
up_write(ids-rw_mutex);
+
return ERR_PTR(err);
 }
 
diff --git a/ipc/util.h b/ipc/util.h
index bfc8d4e..13d92fe 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -128,6 +128,9 @@ struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
+struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct ipc_ids *ids, int id, int 
cmd,
+struct ipc64_perm *perm, int 
extra_perm);
 struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace *ns,
  struct ipc_ids *ids, int id, int cmd,
  struct ipc64_perm *perm, int extra_perm);
-- 
1.7.11.7





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-05 Thread Davidlohr Bueso
On Tue, 2013-03-05 at 07:40 -0800, Linus Torvalds wrote:
 On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso davidlohr.bu...@hp.com 
 wrote:
 
  The following set of patches are based on the discussion of holding the
  ipc lock unnecessarily, such as for permissions and security checks:
 
 Ok, looks fine from a quick look (but then, so did your previous patch-set ;)
 
 You still open-code the spinlock in at least a few places (I saw
 sem_getref), but I still don't care deeply.
 
  2) While on an Oracle swingbench DSS (data mining) workload the
  improvements are not as exciting as with Rik's benchmark, we can see
  some positive numbers. For an 8 socket machine the following are the
  percentages of %sys time incurred in the ipc lock:
 
 Ok, I hoped for it being more noticeable. Since that benchmark is less
 trivial than Rik's, can you do a perf record -fg of it and give a more
 complete picture of what the kernel footprint is - and in particular
 who now gets that ipc lock function? Is it purely semtimedop, or what?
 Look out for inlining - ipc_rcu_getref() looks like it would be
 inlined, for example.
 
 It would be good to get a top twenty kernel functions from the
 profile, along with some call data on where the lock callers are.. I
 know that Rik's benchmark *only* had that one call-site, I'm wondering
 if the swingbench one has slightly more complex behavior...

For a 400 user workload (the kernel functions remain basically the same
for any amount of users):

17.86%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock 

 8.46%  swapper  [kernel.kallsyms]   [k] intel_idle 

 5.51%   oracle  [kernel.kallsyms]   [k] try_atomic_semop   

 5.05%   oracle  [kernel.kallsyms]   [k] update_sd_lb_stats 

 2.81%   oracle  [kernel.kallsyms]   [k] tg_load_down   

 2.41%  swapper  [kernel.kallsyms]   [k] update_blocked_averages

 2.38%   oracle  [kernel.kallsyms]   [k] idle_cpu   

 2.37%  swapper  [kernel.kallsyms]   [k] native_write_msr_safe  

 2.28%   oracle  [kernel.kallsyms]   [k] update_cfs_rq_blocked_load 

 1.84%   oracle  [kernel.kallsyms]   [k] update_blocked_averages

 1.79%   oracle  [kernel.kallsyms]   [k] update_queue   

 1.73%  swapper  [kernel.kallsyms]   [k] update_cfs_rq_blocked_load 

 1.29%   oracle  [kernel.kallsyms]   [k] native_write_msr_safe  

 1.07% java  [kernel.kallsyms]   [k] update_sd_lb_stats 

 0.91%  swapper  [kernel.kallsyms]   [k] poll_idle  

 0.86%   oracle  [kernel.kallsyms]   [k] try_to_wake_up 

 0.80% java  [kernel.kallsyms]   [k] tg_load_down   

 0.72%   oracle  [kernel.kallsyms]   [k] load_balance   

 0.67%   oracle  [kernel.kallsyms]   [k] __schedule 

 0.67%   oracle  [kernel.kallsyms]   [k] cpumask_next_and   


Digging into the _raw_spin_lock call:

 17.86%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock

 |
 --- _raw_spin_lock
|  
|--49.55%-- sys_semtimedop
|  |  
|  |--77.41%-- system_call
|  |  semtimedop
|  |  skgpwwait
|  |  ksliwat
|  |  kslwaitctx


Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] ipc: reduce ipc lock contention

2013-03-05 Thread Davidlohr Bueso
On Tue, 2013-03-05 at 22:53 -0500, Rik van Riel wrote:
 On 03/05/2013 10:46 PM, Waiman Long wrote:
  On 03/05/2013 03:53 PM, Rik van Riel wrote:
 
  Indeed.  Though how well my patches will work with Oracle will
  depend a lot on what kind of semctl syscalls they are doing.
 
  Does Oracle typically do one semop per semctl syscall, or does
  it pass in a whole bunch at once?
 
  i had collected a strace log of Oracle instance startup a while ago. In
  the log, almost all of the semctl() call is to set a single semaphore
  value in one of the element of the array using SETVAL. Also there are
  far more semtimedop() than semctl(), about 100:1. Again, all the
  semtimedop() operations are on a single element of the semaphore array.
 
 That is good to hear. Just what I was hoping when I started
 working on my patches. You should expect them tomorrow or
 Thursday.

Great, looking forward.

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 7/4] ipc: fine grained locking for semtimedop

2013-03-06 Thread Davidlohr Bueso
On Wed, 2013-03-06 at 17:15 -0500, Rik van Riel wrote:
 Introduce finer grained locking for semtimedop, to handle the
 common case of a program wanting to manipulate one semaphore
 from an array with multiple semaphores.
 
 Each semaphore array has a read/write lock. If something
 complex is going on (manipulation of the array, of multiple
 semaphores in one syscall, etc), the lock is taken in exclusive
 mode.
 
 If the call is a semop manipulating just one semaphore in
 an array with multiple semaphores, the read/write lock for
 the semaphore array is taken in shared (read) mode, and the
 individual semaphore's lock is taken.
 
 On a 24 CPU system, performance numbers with the semop-multi
 test with N threads and N semaphores, look like this:
 
   vanilla Davidlohr's Davidlohr's +
 threads   patches rwlock patches
 10610652  726325  1783589
 20341570  365699  1520453
 30288102  307037  1498167
 40290714  305955  1612665
 50288620  312890  1733453
 60289987  306043  1649360
 70291298  306347  1723167
 80290948  305662  1729545
 90290996  306680  1736021
 100   292243  306700  1773700

Lovely numbers :) 

On my laptop:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 281430894, ops/sec 9381029

+  20.87%a.out  [kernel.kallsyms]   [k] sys_semtimedop
+   8.31%a.out  [kernel.kallsyms]   [k] ipc_has_perm.isra.21
+   6.88%a.out  [kernel.kallsyms]   [k] _raw_read_lock
+   6.78%a.out  [kernel.kallsyms]   [k] avc_has_perm_flags
+   5.26%a.out  [kernel.kallsyms]   [k] ipcperms
+   4.91%a.out  [kernel.kallsyms]   [k] ipc_obtain_object_check
+   4.69%a.out  [kernel.kallsyms]   [k] __audit_syscall_exit
+   4.21%a.out  [kernel.kallsyms]   [k] 
copy_user_enhanced_fast_string
+   3.61%a.out  [kernel.kallsyms]   [k] _raw_spin_lock
+   3.55%a.out  [kernel.kallsyms]   [k] system_call
+   3.35%a.out  [kernel.kallsyms]   [k] do_smart_update
+   2.77%a.out  [kernel.kallsyms]   [k] __audit_syscall_entry

But my 8 socket 160 CPU box sure isn't happy. I'm getting all sorts of
issues (sometimes it will boot, sometimes it wont). When it does, linux
will hang as soon as I start my benchmarking:

BUG: soft lockup - CPU#77 stuck for 23s! [oracle:129877]
Modules linked in: fuse autofs4 sunrpc pcc_cpufreq ipv6 dm_mirror 
dm_region_hash dm_log dm_mod uinput iTCO_wdt iTCO_vendor_support sg freq_table 
mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr 
lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core netxen_nic ext4 mbcache jbd2 
sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul 
hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core
CPU 77 
Pid: 129877, comm: oracle Tainted: G  D W3.9.0-rc1+ #20 HP ProLiant 
DL980 G7
RIP: 0010:[812777fa]  [812777fa] __read_lock_failed+0xa/0x20
RSP: 0018:8b87b8cf9ca8  EFLAGS: 0297
RAX: c900293c1020 RBX: 00010007a021 RCX: d3a5
RDX: 0001 RSI: 8b87b8cf9d58 RDI: c900293c1020
RBP: 8b87b8cf9ca8 R08:  R09: 
R10:  R11:  R12: 8b87b8cf9c68
R13: 8b87b8cf9c68 R14: 0286 R15: 8b87caf10100
FS:  7f7a689b2700() GS:8987ff9c() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fc49426d000 CR3: 0187cf08f000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process oracle (pid: 129877, threadinfo 8b87b8cf8000, task 8b87caf10100)
Stack:
 8b87b8cf9cb8 8155f374 8b87b8cf9ce8 81205245
 0001 00090002 7fff82d3aa08 
 8b87b8cf9f78 812069e1 00cbc000 8b87b8cf9f38
Call Trace:
 [8155f374] _raw_read_lock+0x14/0x20
 [81205245] sem_lock+0x85/0xa0
 [812069e1] sys_semtimedop+0x521/0x7c0
 [81089e2c] ? task_sched_runtime+0x4c/0x90
 [8101c1b3] ? native_sched_clock+0x13/0x80
 [8101b7b9] ? sched_clock+0x9/0x10
 [8108f9ed] ? sched_clock_cpu+0xcd/0x110
 [8108914b] ? update_rq_clock+0x2b/0x50
 [81089e2c] ? task_sched_runtime+0x4c/0x90
 [8108fe48] ? thread_group_cputime+0x88/0xc0
 [8108fd1d] ? cputime_adjust+0x3d/0x90
 [8108fece] ? thread_group_cputime_adjusted+0x4e/0x60
 [81568119] system_call_fastpath+0x16/0x1b
Code: 90 55 48 89 e5 f0 ff 07 f3 90 83 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 
2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 f0 48 ff 07 f3 90 48 83 3f 01 78 f8 f0 
48 ff 

[PATCH 1/2] zram: remove unexistant discard from sysfs ABI doc

2013-02-10 Thread Davidlohr Bueso
Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
Both patches apply on top of the staging-next branch
of the staging tree.

 Documentation/ABI/testing/sysfs-block-zram | 9 -
 1 file changed, 9 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index ec93fe3..4627c33 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -52,15 +52,6 @@ Description:
is freed. This statistic is applicable only when this disk is
being used as a swap disk.
 
-What:  /sys/block/zramid/discard
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The discard file is read-only and specifies the number of
-   discard requests received by this device. These requests
-   provide information to block device regarding blocks which are
-   no longer used by filesystem.
-
 What:  /sys/block/zramid/zero_pages
 Date:  August 2010
 Contact:   Nitin Gupta ngu...@vflare.org
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] zram: gather statistics in a unique file

2013-02-10 Thread Davidlohr Bueso
Instead of having one sysfs file per zram statistic, group them all
in a single, reader-friendly, 'statistics' file. This not only reduces
code but is also makes it easier to visualize. The new file looks like:

Number of reads:24
Number of writes:   1055
Invalid IO: 0
Notify free:0
Zero pages: 1042
Orig data size: 49152 bytes
Compressed data:838 bytes
Total memory used:  53248 bytes

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/staging/zram/zram.txt |  20 ++-
 drivers/staging/zram/zram_sysfs.c | 109 --
 2 files changed, 25 insertions(+), 104 deletions(-)

diff --git a/drivers/staging/zram/zram.txt b/drivers/staging/zram/zram.txt
index 765d790..b3111bc 100644
--- a/drivers/staging/zram/zram.txt
+++ b/drivers/staging/zram/zram.txt
@@ -12,7 +12,7 @@ good amounts of memory savings. Some of the usecases include 
/tmp storage,
 use as swap disks, various caches under /var and maybe many more :)
 
 Statistics for individual zram devices are exported through sysfs nodes at
-/sys/block/zramid/
+/sys/block/zramid/statistics
 
 * Usage
 
@@ -42,25 +42,11 @@ Following shows a typical sequence of steps for using zram.
mkfs.ext4 /dev/zram1
mount /dev/zram1 /tmp
 
-4) Stats:
-   Per-device statistics are exported as various nodes under
-   /sys/block/zramid/
-   disksize
-   num_reads
-   num_writes
-   invalid_io
-   notify_free
-   discard
-   zero_pages
-   orig_data_size
-   compr_data_size
-   mem_used_total
-
-5) Deactivate:
+4) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
 
-6) Reset:
+5) Reset:
Write any positive value to 'reset' sysfs node
echo 1  /sys/block/zram0/reset
echo 1  /sys/block/zram1/reset
diff --git a/drivers/staging/zram/zram_sysfs.c 
b/drivers/staging/zram/zram_sysfs.c
index e6a929d..2aac370 100644
--- a/drivers/staging/zram/zram_sysfs.c
+++ b/drivers/staging/zram/zram_sysfs.c
@@ -119,106 +119,41 @@ static ssize_t reset_store(struct device *dev,
return len;
 }
 
-static ssize_t num_reads_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   zram_stat64_read(zram, zram-stats.num_reads));
-}
-
-static ssize_t num_writes_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   zram_stat64_read(zram, zram-stats.num_writes));
-}
-
-static ssize_t invalid_io_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   zram_stat64_read(zram, zram-stats.invalid_io));
-}
-
-static ssize_t notify_free_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   zram_stat64_read(zram, zram-stats.notify_free));
-}
-
-static ssize_t zero_pages_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %u\n, zram-stats.pages_zero);
-}
-
-static ssize_t orig_data_size_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   (u64)(zram-stats.pages_stored)  PAGE_SHIFT);
-}
-
-static ssize_t compr_data_size_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
+static ssize_t statistics_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
 {
struct zram *zram = dev_to_zram(dev);
-
-   return sprintf(buf, %llu\n,
-   zram_stat64_read(zram, zram-stats.compr_size));
-}
-
-static ssize_t mem_used_total_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
-{
-   u64 val = 0;
-   struct zram *zram = dev_to_zram(dev);
struct zram_meta *meta = zram-meta;
 
-   if (zram-init_done)
-   val = zs_get_total_size_bytes(meta-mem_pool);
-
-   return sprintf(buf, %llu\n, val);
+   return sprintf(buf,
+  Number of reads:\t%llu\n
+  Number of writes:\t%llu\n
+  Invalid IO:\t\t%llu\n
+  Notify free:\t\t%llu\n
+  Zero pages:\t\t%u\n
+  Orig data size:\t\t%llu bytes\n
+  Compressed data:\t%llu bytes\n
+  Total memory used:\t%llu bytes\n

Re: [PATCH 2/2] zram: gather statistics in a unique file

2013-02-10 Thread Davidlohr Bueso
Sorry, I forgot to include the updated ABI changes with this patch. Sending v2.

On Sun, 2013-02-10 at 20:29 -0800, Davidlohr Bueso wrote:
 Instead of having one sysfs file per zram statistic, group them all
 in a single, reader-friendly, 'statistics' file. This not only reduces
 code but is also makes it easier to visualize. The new file looks like:
 
 Number of reads:24
 Number of writes:   1055
 Invalid IO: 0
 Notify free:0
 Zero pages: 1042
 Orig data size: 49152 bytes
 Compressed data:838 bytes
 Total memory used:  53248 bytes
 
 Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 ---
  drivers/staging/zram/zram.txt |  20 ++-
  drivers/staging/zram/zram_sysfs.c | 109 
 --
  2 files changed, 25 insertions(+), 104 deletions(-)
 
 diff --git a/drivers/staging/zram/zram.txt b/drivers/staging/zram/zram.txt
 index 765d790..b3111bc 100644
 --- a/drivers/staging/zram/zram.txt
 +++ b/drivers/staging/zram/zram.txt
 @@ -12,7 +12,7 @@ good amounts of memory savings. Some of the usecases 
 include /tmp storage,
  use as swap disks, various caches under /var and maybe many more :)
  
  Statistics for individual zram devices are exported through sysfs nodes at
 -/sys/block/zramid/
 +/sys/block/zramid/statistics
  
  * Usage
  
 @@ -42,25 +42,11 @@ Following shows a typical sequence of steps for using 
 zram.
   mkfs.ext4 /dev/zram1
   mount /dev/zram1 /tmp
  
 -4) Stats:
 - Per-device statistics are exported as various nodes under
 - /sys/block/zramid/
 - disksize
 - num_reads
 - num_writes
 - invalid_io
 - notify_free
 - discard
 - zero_pages
 - orig_data_size
 - compr_data_size
 - mem_used_total
 -
 -5) Deactivate:
 +4) Deactivate:
   swapoff /dev/zram0
   umount /dev/zram1
  
 -6) Reset:
 +5) Reset:
   Write any positive value to 'reset' sysfs node
   echo 1  /sys/block/zram0/reset
   echo 1  /sys/block/zram1/reset
 diff --git a/drivers/staging/zram/zram_sysfs.c 
 b/drivers/staging/zram/zram_sysfs.c
 index e6a929d..2aac370 100644
 --- a/drivers/staging/zram/zram_sysfs.c
 +++ b/drivers/staging/zram/zram_sysfs.c
 @@ -119,106 +119,41 @@ static ssize_t reset_store(struct device *dev,
   return len;
  }
  
 -static ssize_t num_reads_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - zram_stat64_read(zram, zram-stats.num_reads));
 -}
 -
 -static ssize_t num_writes_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - zram_stat64_read(zram, zram-stats.num_writes));
 -}
 -
 -static ssize_t invalid_io_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - zram_stat64_read(zram, zram-stats.invalid_io));
 -}
 -
 -static ssize_t notify_free_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - zram_stat64_read(zram, zram-stats.notify_free));
 -}
 -
 -static ssize_t zero_pages_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %u\n, zram-stats.pages_zero);
 -}
 -
 -static ssize_t orig_data_size_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - (u64)(zram-stats.pages_stored)  PAGE_SHIFT);
 -}
 -
 -static ssize_t compr_data_size_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 +static ssize_t statistics_show(struct device *dev,
 +struct device_attribute *attr, char *buf)
  {
   struct zram *zram = dev_to_zram(dev);
 -
 - return sprintf(buf, %llu\n,
 - zram_stat64_read(zram, zram-stats.compr_size));
 -}
 -
 -static ssize_t mem_used_total_show(struct device *dev,
 - struct device_attribute *attr, char *buf)
 -{
 - u64 val = 0;
 - struct zram *zram = dev_to_zram(dev);
   struct zram_meta *meta = zram-meta;
  
 - if (zram-init_done)
 - val = zs_get_total_size_bytes(meta-mem_pool);
 -
 - return sprintf(buf, %llu\n, val);
 + return sprintf(buf,
 +Number of reads:\t%llu\n
 +Number of writes:\t%llu\n
 +Invalid IO:\t\t%llu\n
 +Notify free:\t\t%llu\n
 +Zero pages:\t\t%u\n

[PATCH v2 2/2] zram: gather statistics in a unique file

2013-02-10 Thread Davidlohr Bueso
Instead of having one sysfs file per zram statistic, group them all
in a single, reader-friendly, 'statistics' file. This not only reduces
code but is also makes it easier to visualize. The new file looks like:

Number of reads:24
Number of writes:   1055
Invalid IO: 0
Notify free:0
Zero pages: 1042
Orig data size: 49152 bytes
Compressed data:838 bytes
Total memory used:  53248 bytes

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 Documentation/ABI/testing/sysfs-block-zram |  85 +++---
 drivers/staging/zram/zram.txt  |  20 +-
 drivers/staging/zram/zram_sysfs.c  | 109 ++---
 3 files changed, 51 insertions(+), 163 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 4627c33..2328d29 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -21,70 +21,37 @@ Description:
device. The reset operation frees all the memory assocaited
with this device.
 
-What:  /sys/block/zramid/num_reads
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
+What:  /sys/block/zramid/statistics
+Date:  February 2013
+Contact:   Davidlohr Bueso davidlohr.bu...@hp.com
 Description:
-   The num_reads file is read-only and specifies the number of
-   reads (failed or successful) done on this device.
+   The statistics file is read-only and shows different zram
+   related statistics:
+   - number of reads (failed or successful) done on this 
device.
 
-What:  /sys/block/zramid/num_writes
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The num_writes file is read-only and specifies the number of
-   writes (failed or successful) done on this device.
+   - number of writes (failed or successful) done on this 
device.
 
-What:  /sys/block/zramid/invalid_io
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The invalid_io file is read-only and specifies the number of
-   non-page-size-aligned I/O requests issued to this device.
+   - invalid IO: Number of non-page-size-aligned I/O 
requests
+ issued to this device.
 
-What:  /sys/block/zramid/notify_free
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The notify_free file is read-only and specifies the number of
-   swap slot free notifications received by this device. These
-   notifications are send to a swap block device when a swap slot
-   is freed. This statistic is applicable only when this disk is
-   being used as a swap disk.
+   - notify free: Number of swap slot free notifications 
received
+  by this device. These notifications are 
send to
+  a swap block device when a swap slot is 
freed.
+  This statistic is applicable only when 
this disk is
+  being used as a swap disk.
 
-What:  /sys/block/zramid/zero_pages
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The zero_pages file is read-only and specifies number of zero
-   filled pages written to this disk. No memory is allocated for
-   such pages.
+   - zero pages:  Number of zero filled pages written to 
this disk.
+  No memory is allocated for such pages.
 
-What:  /sys/block/zramid/orig_data_size
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The orig_data_size file is read-only and specifies uncompressed
-   size of data stored in this disk. This excludes zero-filled
-   pages (zero_pages) since no memory is allocated for them.
-   Unit: bytes
+   - Orig data size: The uncompressed size of data stored 
in this disk.
+ This excludes zero-filled pages 
(zero_pages)
+ since no memory is allocated for them.
 
-What:  /sys/block/zramid/compr_data_size
-Date:  August 2010
-Contact:   Nitin Gupta ngu...@vflare.org
-Description:
-   The compr_data_size file is read-only and specifies compressed
-   size of data stored in this disk. So, compression ratio can be
-   calculated using orig_data_size

Re: [PATCH 2/2] zram: gather statistics in a unique file

2013-02-11 Thread Davidlohr Bueso
On Sun, 2013-02-10 at 21:41 -0800, Greg Kroah-Hartman wrote:
 On Sun, Feb 10, 2013 at 08:29:06PM -0800, Davidlohr Bueso wrote:
  Instead of having one sysfs file per zram statistic, group them all
  in a single, reader-friendly, 'statistics' file. This not only reduces
  code but is also makes it easier to visualize. The new file looks like:
  
  Number of reads:24
  Number of writes:   1055
  Invalid IO: 0
  Notify free:0
  Zero pages: 1042
  Orig data size: 49152 bytes
  Compressed data:838 bytes
  Total memory used:  53248 bytes
  
  Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 
 No, please, the rule for sysfs is one value per file, not files with
 lots of data that you need to parse.

Ok.

 
 If you want to do something like this, then do it in debugfs, but NEVER
 in sysfs.

So, you would you be open to having the statistics file in debugfs and
removing the individual files sysfs?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Davidlohr Bueso
On Wed, 2013-02-27 at 21:58 -0500, Rik van Riel wrote:
 On 02/27/2013 05:13 PM, Linus Torvalds wrote:
 
  On Feb 27, 2013 1:56 PM, Rik van Riel r...@redhat.com
  mailto:r...@redhat.com wrote:
 
  No argument there, but that does in no way negate the need for some
  performance robustness.
 
  The very numbers you posted showed that the backoff was *not* more
  robust. Quite the reverse, there was arguably more variability.
 
 On the other hand, both MCS and the fast queue locks
 implemented by Michel showed low variability and high
 performance.
 
 http://thread.gmane.org/gmane.linux.kernel/1427417
 
  So I really don't like how you make these sweeping statements
  *again*. Numbers talk, bullshit walks.
 
 If you read all the text in my last mail, you will see the
 link to Michel's performance results. The numbers speak for
 themselves.
 
  The fact is, life is complicated. The simple spinlocks tend to work
  really well. People have tried fancy things before, and it turns out
  it's not as simple as they think.
 
 The numbers for both the simple spinlocks and the
 spinlock backoff kind of suck. Both of these have
 high variability, and both eventually fall down
 under heavy load.
 
 The numbers for Michel's MCS and fast queue lock
 implementations appear to be both fast and stable.
 
 I agree that we need numbers.

FWIW I've been doing some benchmarking for Swingbench DSS workloads
(Oracle data mining) comparing Rik and Michel's patches. With lower
amounts of contention, Rik's ticket spinlock is better, but once
contention gets high enough the queued locks performs better.

The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.
attachment: dss-ipclock.png

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Davidlohr Bueso
On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
 On 02/28/2013 06:09 PM, Linus Torvalds wrote:
 
  So I almost think that *everything* there in the semaphore code could
  be done under RCU. The actual spinlock doesn't seem to much matter, at
  least for semaphores. The semaphore values themselves seem to be
  protected by the atomic operations, but I might be wrong about that, I
  didn't even check.
 
 Checking try_atomic_semop and do_smart_update, it looks like neither
 is using atomic operations. That part of the semaphore code would
 still benefit from spinlocks.

Agreed.

 
 The way the code handles a whole batch of semops all at once,
 potentially to multiple semaphores at once, and with the ability
 to undo all of the operations, it looks like the spinlock will
 still need to be per block of semaphores.
 
 I guess the code may still benefit from Michel's locking code,
 after the permission stuff has been moved from under the spinlock.

How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp-lock [q_]spinlock when necessary.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/2] ipc: do not hold ipc lock more than necessary

2013-03-01 Thread Davidlohr Bueso
The following set of not-thoroughly-tested patches are based on the
discussion of holding the ipc lock unnecessarily, such as for permissions
and security checks:

https://lkml.org/lkml/2013/2/28/540

Patch 0/1: Introduces new functions, analogous to ipc_lock and ipc_lock_check
in the ipc utility code, allowing to obtain the ipc object without holding the 
lock.

Patch 0/2: Use the new functions and only acquire the ipc lock when needed.

With Rik's semop-multi.c microbenchmark we can see the following
results:

256 sems without patches:
+  59.40%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+   6.14%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   3.84%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   3.64%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   2.06%a.out  [kernel.kallsyms]  [k] 
copy_user_enhanced_fast_string
+   1.86%a.out  [kernel.kallsyms]  [k] ipc_lock
+   1.75%a.out  [kernel.kallsyms]  [k] __audit_syscall_entry
+   1.69%a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   1.47%a.out  [kernel.kallsyms]  [k] do_smart_update
+   1.43%a.out  [kernel.kallsyms]  [k] pid_vnr
+   1.39%a.out  [kernel.kallsyms]  [k] try_atomic_semop.isra.5

total operations: 151452270, ops/sec 5048409

256 sems with patches:
+  17.47%a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+  11.08%a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   8.81%a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   7.96%a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   6.50%a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   4.67%a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check
+   4.19%a.out  [kernel.kallsyms]  [k] ipcperms
+   3.75%a.out  [kernel.kallsyms]  [k] 
copy_user_enhanced_fast_string
+   3.38%a.out  [kernel.kallsyms]  [k] system_call
+   3.05%a.out  [kernel.kallsyms]  [k] try_atomic_semop.isra.5
+   2.70%a.out  [kernel.kallsyms]  [k] do_smart_update
+   2.60%a.out  [kernel.kallsyms]  [k] __audit_syscall_entry

total operations: 266502912, ops/sec 8883430

While the _raw_spin_lock time is drastically reduced, others do increase.
This results in an overall speedup of ~1.7x regarding ops/sec.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/2] ipc: introduce obtaining a lockless ipc object

2013-03-01 Thread Davidlohr Bueso
Through ipc_lock() and, therefore, ipc_lock_check() we currently
return the locked ipc object. This is not necessary for all situations,
thus introduce, analogous, ipc_obtain_object and ipc_obtain_object_check
functions that only mark the RCU read critical region without acquiring
the lock and return the ipc object.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/util.c | 42 --
 ipc/util.h |  2 ++
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 464a8ab..902f282 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -667,6 +667,21 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, struct 
ipc_perm *out)
out-seq= in-seq;
 }
 
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id)
+{
+   struct kern_ipc_perm *out;
+   int lid = ipcid_to_idx(id);
+
+   rcu_read_lock();
+   out = idr_find(ids-ipcs_idr, lid);
+   if (!out) {
+   rcu_read_unlock();
+   return ERR_PTR(-EINVAL);
+   }
+
+   return out;
+}
+
 /**
  * ipc_lock - Lock an ipc structure without rw_mutex held
  * @ids: IPC identifier set
@@ -679,18 +694,13 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, 
struct ipc_perm *out)
 
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
 {
-   struct kern_ipc_perm *out;
-   int lid = ipcid_to_idx(id);
+   struct kern_ipc_perm *out = ipc_obtain_object(ids, id);
 
-   rcu_read_lock();
-   out = idr_find(ids-ipcs_idr, lid);
-   if (out == NULL) {
-   rcu_read_unlock();
+   if (!out)
return ERR_PTR(-EINVAL);
-   }
 
spin_lock(out-lock);
-   
+
/* ipc_rmid() may have already freed the ID while ipc_lock
 * was spinning: here verify that the structure is still valid
 */
@@ -703,6 +713,18 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
return out;
 }
 
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id)
+{
+   struct kern_ipc_perm *out = ipc_obtain_object(ids, id);
+
+   if (IS_ERR(out))
+   return out;
+
+   if (ipc_checkid(out, id))
+   return ERR_PTR(-EIDRM);
+   return out;
+}
+
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
@@ -784,7 +806,7 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace 
*ns,
int err;
 
down_write(ids-rw_mutex);
-   ipcp = ipc_lock_check(ids, id);
+   ipcp = ipc_obtain_object_check(ids, id);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
goto out_up;
@@ -801,7 +823,7 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace 
*ns,
return ipcp;
 
err = -EPERM;
-   ipc_unlock(ipcp);
+   rcu_read_unlock();
 out_up:
up_write(ids-rw_mutex);
return ERR_PTR(err);
diff --git a/ipc/util.h b/ipc/util.h
index eeb79a1..2c68035 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -123,6 +123,7 @@ void ipc_rcu_getref(void *ptr);
 void ipc_rcu_putref(void *ptr);
 
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
@@ -173,6 +174,7 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 }
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
struct ipc_ops *ops, struct ipc_params *params);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] ipc: semaphores: do not hold ipc lock more than necessary

2013-03-01 Thread Davidlohr Bueso
Instead of holding the ipc lock for permissions and security
checks, among others, only acquire it when necessary.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 ipc/sem.c | 94 ---
 1 file changed, 66 insertions(+), 28 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1..b74a6f7 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -204,6 +204,16 @@ static inline struct sem_array *sem_lock(struct 
ipc_namespace *ns, int id)
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, 
int id)
+{
+   struct kern_ipc_perm *ipcp = ipc_obtain_object(sem_ids(ns), id);
+
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
+   return container_of(ipcp, struct sem_array, sem_perm);
+}
+
 static inline struct sem_array *sem_lock_check(struct ipc_namespace *ns,
int id)
 {
@@ -215,6 +225,17 @@ static inline struct sem_array *sem_lock_check(struct 
ipc_namespace *ns,
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline struct sem_array *sem_obtain_object_check(struct ipc_namespace 
*ns,
+   int id)
+{
+   struct kern_ipc_perm *ipcp = ipc_obtain_object_check(sem_ids(ns), id);
+
+   if (IS_ERR(ipcp))
+   return (struct sem_array *)ipcp;
+
+   return container_of(ipcp, struct sem_array, sem_perm);
+}
+
 static inline void sem_lock_and_putref(struct sem_array *sma)
 {
ipc_lock_by_ptr(sma-sem_perm);
@@ -234,6 +255,16 @@ static inline void sem_putref(struct sem_array *sma)
ipc_unlock((sma)-sem_perm);
 }
 
+/*
+ * Call inside the rcu read section.
+ */
+static inline void sem_getref(struct sem_array *sma)
+{
+   spin_lock((sma)-sem_perm.lock);
+   ipc_rcu_getref(sma);
+   ipc_unlock((sma)-sem_perm);
+}
+
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
 {
ipc_rmid(sem_ids(ns), s-sem_perm);
@@ -842,18 +873,19 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
case SEM_STAT:
{
struct semid64_ds tbuf;
-   int id;
+   int id = 0;
+
+   memset(tbuf, 0, sizeof(tbuf));
 
if (cmd == SEM_STAT) {
-   sma = sem_lock(ns, semid);
+   sma = sem_obtain_object(ns, semid);
if (IS_ERR(sma))
return PTR_ERR(sma);
id = sma-sem_perm.id;
} else {
-   sma = sem_lock_check(ns, semid);
+   sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma))
return PTR_ERR(sma);
-   id = 0;
}
 
err = -EACCES;
@@ -864,13 +896,11 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
if (err)
goto out_unlock;
 
-   memset(tbuf, 0, sizeof(tbuf));
-
kernel_to_ipc64_perm(sma-sem_perm, tbuf.sem_perm);
tbuf.sem_otime  = sma-sem_otime;
tbuf.sem_ctime  = sma-sem_ctime;
tbuf.sem_nsems  = sma-sem_nsems;
-   sem_unlock(sma);
+   rcu_read_unlock();
if (copy_semid_to_user (arg.buf, tbuf, version))
return -EFAULT;
return id;
@@ -879,7 +909,7 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
return -EINVAL;
}
 out_unlock:
-   sem_unlock(sma);
+   rcu_read_unlock();
return err;
 }
 
@@ -894,11 +924,12 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
int nsems;
struct list_head tasks;
 
-   sma = sem_lock_check(ns, semid);
+   INIT_LIST_HEAD(tasks);
+
+   sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma))
return PTR_ERR(sma);
 
-   INIT_LIST_HEAD(tasks);
nsems = sma-sem_nsems;
 
err = -EACCES;
@@ -918,7 +949,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, 
int semnum,
int i;
 
if(nsems  SEMMSL_FAST) {
-   sem_getref_and_unlock(sma);
+   sem_getref(sma);
 
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
@@ -934,6 +965,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, 
int semnum,
}
}
 
+   spin_lock(sma-sem_perm.lock);
for (i = 0; i  sma-sem_nsems; i++)
sem_io[i] = sma-sem_base[i].semval;
sem_unlock(sma);
@@ -947,7 +979,8 @@ static int semctl_main(struct ipc_namespace *ns

[PATCH] lib/int_sqrt: optimize square root algorithm

2013-02-20 Thread Davidlohr Bueso
From: Davidlohr Bueso davidlohr.bu...@hp.com

This patch optimizes the current version of the shift-and-subtract
(hardware) algorithm, described by John von Newmann[1] and Guy L. Steele.

Iterating 1,000,000 times, perf shows for the current version:

 Performance counter stats for './sqrt-curr' (10 runs):

 27.170996 task-clock#0.979 CPUs utilized   
 ( +-  3.19% )
 3 context-switches  #0.103 K/sec   
 ( +-  4.76% )
 0 cpu-migrations#0.004 K/sec   
 ( +-100.00% )
   104 page-faults   #0.004 M/sec   
 ( +-  0.16% )
64,921,199 cycles#2.389 GHz 
 ( +-  0.03% )
28,967,789 stalled-cycles-frontend   #   44.62% frontend cycles idle
 ( +-  0.18% )
   not supported stalled-cycles-backend
   104,502,623 instructions  #1.61  insns per cycle
 #0.28  stalled cycles per insn 
 ( +-  0.00% )
34,088,368 branches  # 1254.587 M/sec   
 ( +-  0.00% )
 4,901 branch-misses #0.01% of all branches 
 ( +-  1.32% )

   0.027763015 seconds time elapsed 
 ( +-  3.22% )

And for the new version:

Performance counter stats for './sqrt-new' (10 runs):

  0.496869 task-clock#0.519 CPUs utilized   
 ( +-  2.38% )
 0 context-switches  #0.000 K/sec
 0 cpu-migrations#0.403 K/sec   
 ( +-100.00% )
   104 page-faults   #0.209 M/sec   
 ( +-  0.15% )
   590,760 cycles#1.189 GHz 
 ( +-  2.35% )
   395,053 stalled-cycles-frontend   #   66.87% frontend cycles idle
 ( +-  3.67% )
   not supported stalled-cycles-backend
   398,963 instructions  #0.68  insns per cycle
 #0.99  stalled cycles per insn 
 ( +-  0.39% )
70,228 branches  #  141.341 M/sec   
 ( +-  0.36% )
 3,364 branch-misses #4.79% of all branches 
 ( +-  5.45% )

   0.000957440 seconds time elapsed 
 ( +-  2.42% )

Furthermore, this saves space in instruction text:

   textdata bss dec hex filename
111   0   0 111  6f lib/int_sqrt-baseline.o
 89   0   0  89  59 lib/int_sqrt.o

[1] http://en.wikipedia.org/wiki/First_Draft_of_a_Report_on_the_EDVAC

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
Reviewed-and-tested-by: Jonathan Gonzalez jgonz...@linets.cl
---
 lib/int_sqrt.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index fc2eeb7..1ef4cc3 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -1,3 +1,9 @@
+/*
+ * Copyright (C) 2013 Davidlohr Bueso davidlohr.bu...@hp.com
+ *
+ *  Based on the shift-and-subtract algorithm for computing integer
+ *  square root from Guy L. Steele.
+ */
 
 #include linux/kernel.h
 #include linux/export.h
@@ -10,23 +16,23 @@
  */
 unsigned long int_sqrt(unsigned long x)
 {
-   unsigned long op, res, one;
+   unsigned long b, m, y = 0;
 
-   op = x;
-   res = 0;
+   if (x = 1)
+   return x;
 
-   one = 1UL  (BITS_PER_LONG - 2);
-   while (one  op)
-   one = 2;
+   m = 1UL  (BITS_PER_LONG - 2);
+   while (m != 0) {
+   b = y + m;
+   y = 1;
 
-   while (one != 0) {
-   if (op = res + one) {
-   op = op - (res + one);
-   res = res +  2 * one;
+   if (x = b) {
+   x -= b;
+   y += m;
}
-   res /= 2;
-   one /= 4;
+   m = 2;
}
-   return res;
+
+   return y;
 }
 EXPORT_SYMBOL(int_sqrt);
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernel/sys: initialize return codes when declaring variables

2012-09-06 Thread Davidlohr Bueso
Trivially initialize return codes with default values when
the variable is declared.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 kernel/sys.c |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 241507f..b3b2ef7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1364,7 +1364,7 @@ SYSCALL_DEFINE1(olduname, struct oldold_utsname __user *, 
name)
 
 SYSCALL_DEFINE2(sethostname, char __user *, name, int, len)
 {
-   int errno;
+   int errno = -EFAULT;
char tmp[__NEW_UTS_LEN];
 
if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
@@ -1373,7 +1373,6 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
if (len  0 || len  __NEW_UTS_LEN)
return -EINVAL;
down_write(uts_sem);
-   errno = -EFAULT;
if (!copy_from_user(tmp, name, len)) {
struct new_utsname *u = utsname();
 
@@ -1390,7 +1389,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
 
 SYSCALL_DEFINE2(gethostname, char __user *, name, int, len)
 {
-   int i, errno;
+   int i, errno = 0;
struct new_utsname *u;
 
if (len  0)
@@ -1400,7 +1399,6 @@ SYSCALL_DEFINE2(gethostname, char __user *, name, int, 
len)
i = 1 + strlen(u-nodename);
if (i  len)
i = len;
-   errno = 0;
if (copy_to_user(name, u-nodename, i))
errno = -EFAULT;
up_read(uts_sem);
@@ -1415,7 +1413,7 @@ SYSCALL_DEFINE2(gethostname, char __user *, name, int, 
len)
  */
 SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
 {
-   int errno;
+   int errno = -EFAULT;
char tmp[__NEW_UTS_LEN];
 
if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
@@ -1424,7 +1422,6 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, 
len)
return -EINVAL;
 
down_write(uts_sem);
-   errno = -EFAULT;
if (!copy_from_user(tmp, name, len)) {
struct new_utsname *u = utsname();
 
-- 
1.7.5.4



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lib: gcd: prevent possible div by 0

2012-09-09 Thread Davidlohr Bueso
Account for properties when a and/or b are 0:
gcd(0, 0) = 0
gcd(a, 0) = a
gcd(0, b) = b

Cc: sta...@vger.kernel.org
Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 lib/gcd.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/gcd.c b/lib/gcd.c
index cce4f3c..7e163c6 100644
--- a/lib/gcd.c
+++ b/lib/gcd.c
@@ -7,6 +7,9 @@ unsigned long gcd(unsigned long a, unsigned long b)
 {
unsigned long r;
 
+   if (!a || !b)
+   return a | b;
+
if (a  b)
swap(a, b);
while ((r = a % b) != 0) {
-- 
1.7.9.5



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: gcd: prevent possible div by 0

2012-09-10 Thread Davidlohr Bueso
On Mon, 2012-09-10 at 11:12 +0200, Eric Dumazet wrote:
 On Sun, 2012-09-09 at 17:03 +0200, Davidlohr Bueso wrote:
  Account for properties when a and/or b are 0:
  gcd(0, 0) = 0
  gcd(a, 0) = a
  gcd(0, b) = b
  
  Cc: sta...@vger.kernel.org
  Signed-off-by: Davidlohr Bueso d...@gnu.org
  ---
   lib/gcd.c |3 +++
   1 file changed, 3 insertions(+)
  
  diff --git a/lib/gcd.c b/lib/gcd.c
  index cce4f3c..7e163c6 100644
  --- a/lib/gcd.c
  +++ b/lib/gcd.c
  @@ -7,6 +7,9 @@ unsigned long gcd(unsigned long a, unsigned long b)
   {
  unsigned long r;
   
  +   if (!a || !b)
  +   return a | b;
 
 This seems overkill

It might, but it reads better, IMHO.

 
  +
  if (a  b)
  swap(a, b);
 
 better here to :
   if (!b)
   return a;
 

Sure, I don't mind either way. I'll send a v2 shortly.

Thanks for reviewing.
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] lib: gcd: prevent possible div by 0

2012-09-10 Thread Davidlohr Bueso
Account for all properties when a and/or b are 0:
gcd(0, 0) = 0
gcd(a, 0) = a
gcd(0, b) = b

Cc: sta...@vger.kernel.org
Signed-off-by: Davidlohr Bueso d...@gnu.org
---
V2: simplified checking with b = 0 (Eric)

 lib/gcd.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/gcd.c b/lib/gcd.c
index cce4f3c..3657f12 100644
--- a/lib/gcd.c
+++ b/lib/gcd.c
@@ -9,6 +9,9 @@ unsigned long gcd(unsigned long a, unsigned long b)
 
if (a  b)
swap(a, b);
+
+   if (!b)
+   return a;
while ((r = a % b) != 0) {
a = b;
b = r;
-- 
1.7.9.5



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib: gcd: prevent possible div by 0

2012-09-12 Thread Davidlohr Bueso
ping? Cc'ing Greg for stable.

On Mon, 2012-09-10 at 16:35 +0200, Davidlohr Bueso wrote:
 Account for all properties when a and/or b are 0:
 gcd(0, 0) = 0
 gcd(a, 0) = a
 gcd(0, b) = b
 
 Cc: sta...@vger.kernel.org
 Signed-off-by: Davidlohr Bueso d...@gnu.org
 ---
 V2: simplified checking with b = 0 (Eric)
 
  lib/gcd.c |3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/lib/gcd.c b/lib/gcd.c
 index cce4f3c..3657f12 100644
 --- a/lib/gcd.c
 +++ b/lib/gcd.c
 @@ -9,6 +9,9 @@ unsigned long gcd(unsigned long a, unsigned long b)
  
   if (a  b)
   swap(a, b);
 +
 + if (!b)
 + return a;
   while ((r = a % b) != 0) {
   a = b;
   b = r;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib: gcd: prevent possible div by 0

2012-09-12 Thread Davidlohr Bueso
On Wed, 2012-09-12 at 12:10 -0700, Andrew Morton wrote:
 On Mon, 10 Sep 2012 16:35:19 +0200
 Davidlohr Bueso d...@gnu.org wrote:
 
  Account for all properties when a and/or b are 0:
  gcd(0, 0) = 0
  gcd(a, 0) = a
  gcd(0, b) = b
  
  Cc: sta...@vger.kernel.org
 
 Why cc:stable?  If this patch fixes some known problem in the current
 kernel then that really really should have been described in the
 changelog.  Always.  Please.

Ok, I will keep it in mind next time. No known problem (at least that I
know of), but due to the nature of the potential bug, I thought that it
was worth adding it to stable.

Thanks.

 
  ...
  --- a/lib/gcd.c
  +++ b/lib/gcd.c
  @@ -9,6 +9,9 @@ unsigned long gcd(unsigned long a, unsigned long b)
   
  if (a  b)
  swap(a, b);
  +
  +   if (!b)
  +   return a;
  while ((r = a % b) != 0) {
  a = b;
  b = r;
  -- 
  1.7.9.5
  
  
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib: gcd: prevent possible div by 0

2012-09-12 Thread Davidlohr Bueso
On Wed, 2012-09-12 at 12:36 -0700, Andrew Morton wrote:
 On Wed, 12 Sep 2012 21:20:30 +0200
 Davidlohr Bueso d...@gnu.org wrote:
 
  On Wed, 2012-09-12 at 12:10 -0700, Andrew Morton wrote:
   On Mon, 10 Sep 2012 16:35:19 +0200
   Davidlohr Bueso d...@gnu.org wrote:
   
Account for all properties when a and/or b are 0:
gcd(0, 0) = 0
gcd(a, 0) = a
gcd(0, b) = b

Cc: sta...@vger.kernel.org
   
   Why cc:stable?  If this patch fixes some known problem in the current
   kernel then that really really should have been described in the
   changelog.  Always.  Please.
  
  Ok, I will keep it in mind next time. No known problem (at least that I
  know of), but due to the nature of the potential bug, I thought that it
  was worth adding it to stable.
 
 OK.
 
 I'm not personally averse to fixing such problems in -stable,
 particualrly in lib/ code.  After all, people who take -stable kernels
 will then change them and add drivers and backport changes from later
 kernels, etc.  They might be bitten by such a bug.

Yes, my thoughts exactly.
 
 
 I'm scratching my head a bit at the patch though.  What does gcd(0, 13)
 mean?  That 0 can be divided by 13 zero times, which is an integer
 result?  I wonder why any non-buggy code would do that
 

While I've been away from this kind of math for a while, based on the
Euclid's algorithm, if r = a mod b, then gcd(a, b) = gcd(b, r), so:

gcd(0, 13) = gcd(13, 0 mod 13) = gcd(13, 0)

Since the GCD  of a and b is the largest integer that divides both a
and b with no remainder, when r = 0, the algorithm will stop and
therefore gcd(13, 0) = 13.

http://mitpress.mit.edu/sicp/full-text/sicp/book/node19.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: keucr: remove String func prototypes

2012-09-12 Thread Davidlohr Bueso
Commit 1b9f644dfeb638e0146ce54f4e48c87a2841a603 already got rid of
StringCopy and StringCmp, so remove the left over prototypes.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 drivers/staging/keucr/smcommon.h |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/staging/keucr/smcommon.h b/drivers/staging/keucr/smcommon.h
index 278bdb8..4d57203 100644
--- a/drivers/staging/keucr/smcommon.h
+++ b/drivers/staging/keucr/smcommon.h
@@ -25,7 +25,5 @@ Define Difinetion
 #define ERR_NoSmartMedia0x003A /* Medium Not Present */
 
 /***/
-void StringCopy(char *, char *, int);
-int  StringCmp(char *, char *, int);
 
 #endif
-- 
1.7.9.5



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: VMX: invalidate vpid for invlpg instruction

2012-08-31 Thread Davidlohr Bueso
For processors that support VPIDs we should invalidate the page table entry
specified by the lineal address. For this purpose add support for individual
address invalidations.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 arch/x86/include/asm/vmx.h |6 --
 arch/x86/kvm/vmx.c |   15 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 74fcb96..20abb18 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -393,6 +393,7 @@ enum vmcs_field {
 #define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)
 
 #define VMX_NR_VPIDS   (1  16)
+#define VMX_VPID_EXTENT_INDIVIDUAL_ADDR 0
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1
 #define VMX_VPID_EXTENT_ALL_CONTEXT2
 
@@ -406,12 +407,13 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull  14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull  16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull  17)
-#define VMX_EPT_AD_BIT (1ull  21)
+#define VMX_EPT_AD_BIT (1ull  21)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull  24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull  25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull  26)
 
-#define VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT  (1ull  9) /* (41 - 32) */
+#define VMX_VPID_EXTENT_INDIVIDUAL_ADDR_BIT (1ull  8)  /* (40 - 32) */
+#define VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT  (1ull  9)  /* (41 - 32) */
 #define VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT  (1ull  10) /* (42 - 32) */
 
 #define VMX_EPT_DEFAULT_GAW3
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c00f03d..d87b22c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -816,6 +816,11 @@ static inline bool cpu_has_vmx_invept_global(void)
return vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT;
 }
 
+static inline bool cpu_has_vmx_invvpid_individual_addr(void)
+{
+   return vmx_capability.vpid  VMX_VPID_EXTENT_INDIVIDUAL_ADDR_BIT;
+}
+
 static inline bool cpu_has_vmx_invvpid_single(void)
 {
return vmx_capability.vpid  VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT;
@@ -1011,6 +1016,15 @@ static void loaded_vmcs_clear(struct loaded_vmcs 
*loaded_vmcs)
loaded_vmcs-cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
 }
 
+static inline void vpid_sync_vcpu_individual_addr(struct vcpu_vmx *vmx, gpa_t 
gpa)
+{
+   if (vmx-vpid == 0)
+   return;
+
+   if (cpu_has_vmx_invvpid_individual_addr())
+   __invvpid(VMX_VPID_EXTENT_INDIVIDUAL_ADDR, vmx-vpid, gpa);
+}
+
 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
 {
if (vmx-vpid == 0)
@@ -4719,6 +4733,7 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 
kvm_mmu_invlpg(vcpu, exit_qualification);
+   vpid_sync_vcpu_individual_addr(to_vmx(vcpu), exit_qualification);
skip_emulated_instruction(vcpu);
return 1;
 }
-- 
1.7.4.1



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: VMX: invalidate vpid for invlpg instruction

2012-09-02 Thread Davidlohr Bueso
On Fri, 2012-08-31 at 14:37 -0300, Marcelo Tosatti wrote:
 On Fri, Aug 31, 2012 at 06:10:48PM +0200, Davidlohr Bueso wrote:
  For processors that support VPIDs we should invalidate the page table entry
  specified by the lineal address. For this purpose add support for individual
  address invalidations.
 
 Not necessary - a single context invalidation is performed through
 KVM_REQ_TLB_FLUSH.

Since vpid_sync_context() supports both single and all-context vpid
invalidations, wouldn't it make sense to also add individual address
ones as well, supporting further granularity?

 
 Single-context. If the INVVPID type is 1, the logical processor
 invalidates all
 linear mappings and combined mappings associated with the VPID specified
 in the INVVPID descriptor.
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] acpi: remove some legacy procfs interfaces

2012-09-02 Thread Davidlohr Bueso
Hi,

This patchset is a first attempt to remove some of the deprecated procfs
ACPI interfaces - in the final overall idea to remove /proc/acpi entirely. 
Based on the feature removal file, the CONFIG_ACPI_PROCFS_POWER and 
CONFIG_ACPI_PROC_EVENT
options are dropped.

patch 1: removes CONFIG_ACPI_PROCFS_POWER
patch 2-5: removes CONFIG_ACPI_PROC_EVENT for acpi and respective drivers
that use /proc/acpi/event.

The set applies ontop of Linus' latest.

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] acpi: events: remove procfs interface

2012-09-02 Thread Davidlohr Bueso
The /proc/acpi/event interface has been replaced by events through the
input layer and netlink, and scheduled for removal over four years ago.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 Documentation/feature-removal-schedule.txt |8 --
 drivers/acpi/Kconfig   |   18 -
 drivers/acpi/ac.c  |1 -
 drivers/acpi/acpi_pad.c|1 -
 drivers/acpi/battery.c |2 -
 drivers/acpi/bus.c |   98 -
 drivers/acpi/button.c  |2 -
 drivers/acpi/event.c   |  107 +---
 drivers/acpi/processor_driver.c|4 -
 drivers/acpi/sbs.c |   16 +
 drivers/acpi/thermal.c |3 -
 drivers/acpi/video.c   |   10 ---
 include/acpi/acpi_bus.h|8 --
 13 files changed, 4 insertions(+), 274 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index db385ee..3021e77 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -181,14 +181,6 @@ Who:   Zhang Rui rui.zh...@intel.com
 
 ---
 
-What:  /proc/acpi/event
-When:  February 2008
-Why:   /proc/acpi/event has been replaced by events via the input layer
-   and netlink since 2.6.23.
-Who:   Len Brown len.br...@intel.com
-

-
 What:  i386/x86_64 bzImage symlinks
 When:  April 2010
 
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 6aa0cc8..37110d1 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -74,24 +74,6 @@ config ACPI_EC_DEBUGFS
  Thus this option is a debug option that helps to write ACPI drivers
  and can be used to identify ACPI code or EC firmware bugs.
 
-config ACPI_PROC_EVENT
-   bool Deprecated /proc/acpi/event support
-   depends on PROC_FS
-   default y
-   help
- A user-space daemon, acpid, typically reads /proc/acpi/event
- and handles all ACPI-generated events.
-
- These events are now delivered to user-space either
- via the input layer or as netlink events.
-
- This build option enables the old code for legacy
- user-space implementation.  After some time, this will
- be moved under CONFIG_ACPI_PROCFS, and then deleted.
-
- Say Y here to retain the old behaviour.  Say N if your
- user-space is newer than kernel 2.6.23 (September 2007).
-
 config ACPI_AC
tristate AC Adapter
depends on X86
diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index 7e00303..af56697 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -156,7 +156,6 @@ static void acpi_ac_notify(struct acpi_device *device, u32 
event)
case ACPI_NOTIFY_BUS_CHECK:
case ACPI_NOTIFY_DEVICE_CHECK:
acpi_ac_get_state(ac);
-   acpi_bus_generate_proc_event(device, event, (u32) ac-state);
acpi_bus_generate_netlink_event(device-pnp.device_class,
  dev_name(device-dev), event,
  (u32) ac-state);
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index af4aad6..fe1085d 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -452,7 +452,6 @@ static void acpi_pad_notify(acpi_handle handle, u32 event,
switch (event) {
case ACPI_PROCESSOR_AGGREGATOR_NOTIFY:
acpi_pad_handle_notify(handle);
-   acpi_bus_generate_proc_event(device, event, 0);
acpi_bus_generate_netlink_event(device-pnp.device_class,
dev_name(device-dev), event, 0);
break;
diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index bd364a4..38a37bd 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -657,8 +657,6 @@ static void acpi_battery_notify(struct acpi_device *device, 
u32 event)
if (event == ACPI_BATTERY_NOTIFY_INFO)
acpi_battery_refresh(battery);
acpi_battery_update(battery);
-   acpi_bus_generate_proc_event(device, event,
-acpi_battery_present(battery));
acpi_bus_generate_netlink_event(device-pnp.device_class,
dev_name(device-dev), event,
acpi_battery_present(battery));
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 9628652..63b903b 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -593,104 +593,6 @@ static void acpi_bus_osc_support(void)
 }
 
 /* --
-Event Management

[PATCH 1/5] acpi: remove CONFIG_ACPI_PROCFS_POWER option

2012-09-02 Thread Davidlohr Bueso
The long time deprecated procfs interface for ACPI power devices has
been scheduled for removal since linux 2.6.39.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 Documentation/feature-removal-schedule.txt |   11 -
 drivers/acpi/Kconfig   |   17 --
 drivers/acpi/Makefile  |1 -
 drivers/acpi/ac.c  |  128 +---
 drivers/acpi/battery.c |  328 +---
 drivers/acpi/cm_sbs.c  |  105 -
 drivers/acpi/sbs.c |  333 +---
 7 files changed, 8 insertions(+), 915 deletions(-)
 delete mode 100644 drivers/acpi/cm_sbs.c

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index afaff31..db385ee 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -181,17 +181,6 @@ Who:   Zhang Rui rui.zh...@intel.com
 
 ---
 
-What:  CONFIG_ACPI_PROCFS_POWER
-When:  2.6.39
-Why:   sysfs I/F for ACPI power devices, including AC and Battery,
-has been working in upstream kernel since 2.6.24, Sep 2007.
-   In 2.6.37, we make the sysfs I/F always built in and this option
-   disabled by default.
-   Remove this option and the ACPI power procfs interface in 2.6.39.
-Who:   Zhang Rui rui.zh...@intel.com
-

-
 What:  /proc/acpi/event
 When:  February 2008
 Why:   /proc/acpi/event has been replaced by events via the input layer
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 8099895..6aa0cc8 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -56,23 +56,6 @@ config ACPI_PROCFS
 
  Say N to delete /proc/acpi/ files that have moved to /sys/
 
-config ACPI_PROCFS_POWER
-   bool Deprecated power /proc/acpi directories
-   depends on PROC_FS
-   help
- For backwards compatibility, this option allows
-  deprecated power /proc/acpi/ directories to exist, even when
-  they have been replaced by functions in /sys.
-  The deprecated directories (and their replacements) include:
- /proc/acpi/battery/* (/sys/class/power_supply/*)
- /proc/acpi/ac_adapter/* (sys/class/power_supply/*)
- This option has no effect on /proc/acpi/ directories
- and functions, which do not yet exist in /sys
- This option, together with the proc directories, will be
- deleted in 2.6.39.
-
- Say N to delete power /proc/acpi/ directories that have moved to /sys/
-
 config ACPI_EC_DEBUGFS
tristate EC read/write access through /sys/kernel/debug/ec
default n
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 47199e2..4455be2 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -41,7 +41,6 @@ acpi-y+= event.o
 acpi-y += sysfs.o
 acpi-$(CONFIG_DEBUG_FS)+= debugfs.o
 acpi-$(CONFIG_ACPI_NUMA)   += numa.o
-acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o
 ifdef CONFIG_ACPI_VIDEO
 acpi-y += video_detect.o
 endif
diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index d5fdd36..7e00303 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -28,10 +28,6 @@
 #include linux/slab.h
 #include linux/init.h
 #include linux/types.h
-#ifdef CONFIG_ACPI_PROCFS_POWER
-#include linux/proc_fs.h
-#include linux/seq_file.h
-#endif
 #include linux/power_supply.h
 #include acpi/acpi_bus.h
 #include acpi/acpi_drivers.h
@@ -53,12 +49,6 @@ MODULE_AUTHOR(Paul Diefenbaugh);
 MODULE_DESCRIPTION(ACPI AC Adapter Driver);
 MODULE_LICENSE(GPL);
 
-#ifdef CONFIG_ACPI_PROCFS_POWER
-extern struct proc_dir_entry *acpi_lock_ac_dir(void);
-extern void *acpi_unlock_ac_dir(struct proc_dir_entry *acpi_ac_dir);
-static int acpi_ac_open_fs(struct inode *inode, struct file *file);
-#endif
-
 static int acpi_ac_add(struct acpi_device *device);
 static int acpi_ac_remove(struct acpi_device *device, int type);
 static void acpi_ac_notify(struct acpi_device *device, u32 event);
@@ -95,16 +85,6 @@ struct acpi_ac {
 
 #define to_acpi_ac(x) container_of(x, struct acpi_ac, charger)
 
-#ifdef CONFIG_ACPI_PROCFS_POWER
-static const struct file_operations acpi_ac_fops = {
-   .owner = THIS_MODULE,
-   .open = acpi_ac_open_fs,
-   .read = seq_read,
-   .llseek = seq_lseek,
-   .release = single_release,
-};
-#endif
-
 /* --
AC Adapter Management
-- 
*/
@@ -156,83 +136,6 @@ static enum power_supply_property ac_props[] = {
POWER_SUPPLY_PROP_ONLINE,
 };
 
-#ifdef CONFIG_ACPI_PROCFS_POWER
-/* --
-  FS

[PATCH 2/5] sonypi: remove acpi_bus_generate_proc_event

2012-09-02 Thread Davidlohr Bueso
Calling this function no longer makes sense as /proc/acpi/event
is being removed.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 drivers/char/sonypi.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/drivers/char/sonypi.c b/drivers/char/sonypi.c
index f877805..4543473 100644
--- a/drivers/char/sonypi.c
+++ b/drivers/char/sonypi.c
@@ -876,11 +876,6 @@ found:
if (useinput)
sonypi_report_input_event(event);
 
-#ifdef CONFIG_ACPI
-   if (sonypi_acpi_device)
-   acpi_bus_generate_proc_event(sonypi_acpi_device, 1, event);
-#endif
-
kfifo_in_locked(sonypi_device.fifo, (unsigned char *)event,
sizeof(event), sonypi_device.fifo_lock);
kill_fasync(sonypi_device.fifo_async, SIGIO, POLL_IN);
-- 
1.7.4.1




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] PCI: hotplug: remove acpi_bus_generate_proc_event

2012-09-02 Thread Davidlohr Bueso
Calling this function no longer makes sense as /proc/acpi/event
is being removed.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 drivers/pci/hotplug/acpiphp_ibm.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/hotplug/acpiphp_ibm.c 
b/drivers/pci/hotplug/acpiphp_ibm.c
index c35e8ad..5394fff 100644
--- a/drivers/pci/hotplug/acpiphp_ibm.c
+++ b/drivers/pci/hotplug/acpiphp_ibm.c
@@ -270,7 +270,6 @@ static void ibm_handle_events(acpi_handle handle, u32 
event, void *context)
 
if (subevent == 0x80) {
dbg(%s: generationg bus event\n, __func__);
-   acpi_bus_generate_proc_event(note-device, note-event, detail);
acpi_bus_generate_netlink_event(note-device-pnp.device_class,
  dev_name(note-device-dev),
  note-event, detail);
-- 
1.7.4.1




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] platform: x86: remove acpi_bus_generate_proc_event

2012-09-02 Thread Davidlohr Bueso
Calling this function no longer makes sense as /proc/acpi/event
is being removed.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 drivers/platform/x86/asus-laptop.c  |1 -
 drivers/platform/x86/eeepc-laptop.c |1 -
 drivers/platform/x86/fujitsu-laptop.c   |4 
 drivers/platform/x86/panasonic-laptop.c |2 --
 drivers/platform/x86/sony-laptop.c  |3 ---
 drivers/platform/x86/thinkpad_acpi.c|   11 ---
 6 files changed, 0 insertions(+), 22 deletions(-)

diff --git a/drivers/platform/x86/asus-laptop.c 
b/drivers/platform/x86/asus-laptop.c
index e38f91b..6c4e31d 100644
--- a/drivers/platform/x86/asus-laptop.c
+++ b/drivers/platform/x86/asus-laptop.c
@@ -1514,7 +1514,6 @@ static void asus_acpi_notify(struct acpi_device *device, 
u32 event)
 
/* TODO Find a better way to handle events count. */
count = asus-event_count[event % 128]++;
-   acpi_bus_generate_proc_event(asus-device, event, count);
acpi_bus_generate_netlink_event(asus-device-pnp.device_class,
dev_name(asus-device-dev), event,
count);
diff --git a/drivers/platform/x86/eeepc-laptop.c 
b/drivers/platform/x86/eeepc-laptop.c
index dab91b4..8a94d4d 100644
--- a/drivers/platform/x86/eeepc-laptop.c
+++ b/drivers/platform/x86/eeepc-laptop.c
@@ -1267,7 +1267,6 @@ static void eeepc_acpi_notify(struct acpi_device *device, 
u32 event)
if (event  ACPI_MAX_SYS_NOTIFY)
return;
count = eeepc-event_count[event % 128]++;
-   acpi_bus_generate_proc_event(device, event, count);
acpi_bus_generate_netlink_event(device-pnp.device_class,
dev_name(device-dev), event,
count);
diff --git a/drivers/platform/x86/fujitsu-laptop.c 
b/drivers/platform/x86/fujitsu-laptop.c
index c4c1a54..2fac1c5 100644
--- a/drivers/platform/x86/fujitsu-laptop.c
+++ b/drivers/platform/x86/fujitsu-laptop.c
@@ -773,8 +773,6 @@ static void acpi_fujitsu_notify(struct acpi_device *device, 
u32 event)
else
set_lcd_level(newb);
}
-   acpi_bus_generate_proc_event(fujitsu-dev,
-   ACPI_VIDEO_NOTIFY_INC_BRIGHTNESS, 0);
keycode = KEY_BRIGHTNESSUP;
} else if (oldb  newb) {
if (disable_brightness_adjust != 1) {
@@ -783,8 +781,6 @@ static void acpi_fujitsu_notify(struct acpi_device *device, 
u32 event)
else
set_lcd_level(newb);
}
-   acpi_bus_generate_proc_event(fujitsu-dev,
-   ACPI_VIDEO_NOTIFY_DEC_BRIGHTNESS, 0);
keycode = KEY_BRIGHTNESSDOWN;
}
break;
diff --git a/drivers/platform/x86/panasonic-laptop.c 
b/drivers/platform/x86/panasonic-laptop.c
index 8e8caa7..a140e7f 100644
--- a/drivers/platform/x86/panasonic-laptop.c
+++ b/drivers/platform/x86/panasonic-laptop.c
@@ -465,8 +465,6 @@ static void acpi_pcc_generate_keyinput(struct pcc_acpi *pcc)
return;
}
 
-   acpi_bus_generate_proc_event(pcc-device, HKEY_NOTIFY, result);
-
if (!sparse_keymap_report_event(hotk_input_dev,
result  0xf, result  0x80, false))
ACPI_DEBUG_PRINT((ACPI_DB_ERROR,
diff --git a/drivers/platform/x86/sony-laptop.c 
b/drivers/platform/x86/sony-laptop.c
index daaddec..5bfcfbc 100644
--- a/drivers/platform/x86/sony-laptop.c
+++ b/drivers/platform/x86/sony-laptop.c
@@ -1269,8 +1269,6 @@ static void sony_nc_notify(struct acpi_device *device, 
u32 event)
sony_laptop_report_input_event(real_ev);
}
 
-   acpi_bus_generate_proc_event(sony_nc_acpi_device, ev_type, real_ev);
-
acpi_bus_generate_netlink_event(sony_nc_acpi_device-pnp.device_class,
dev_name(sony_nc_acpi_device-dev), ev_type, real_ev);
 }
@@ -4100,7 +4098,6 @@ static irqreturn_t sony_pic_irq(int irq, void *dev_id)
 
 found:
sony_laptop_report_input_event(device_event);
-   acpi_bus_generate_proc_event(dev-acpi_dev, 1, device_event);
sonypi_compat_report_event(device_event);
return IRQ_HANDLED;
 }
diff --git a/drivers/platform/x86/thinkpad_acpi.c 
b/drivers/platform/x86/thinkpad_acpi.c
index 80e3779..fb105fa 100644
--- a/drivers/platform/x86/thinkpad_acpi.c
+++ b/drivers/platform/x86/thinkpad_acpi.c
@@ -2286,10 +2286,6 @@ static struct tp_acpi_drv_struct ibm_hotkey_acpidriver;
 static void tpacpi_hotkey_send_key(unsigned int scancode)
 {
tpacpi_input_send_key_masked(scancode);
-   if (hotkey_report_mode  2) {
-   acpi_bus_generate_proc_event(ibm_hotkey_acpidriver.device,
-   0x80

[PATCH 1/3] partitions: efi: compare first and last usable LBAs

2012-09-05 Thread Davidlohr Bueso
When verifying GPT header integrity, make sure that
first usable LBA is smaller than last usable LBA.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 6296b40..7795bb4 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -344,6 +344,12 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
 * within the disk.
 */
lastlba = last_lba(state-bdev);
+   if (le64_to_cpu((*gpt)-last_usable_lba)  
le64_to_cpu((*gpt)-first_usable_lba)) {
+   pr_debug(GPT: last_usable_lba incorrect: %lld  %lld\n,
+(unsigned long 
long)le64_to_cpu((*gpt)-last_usable_lba),
+(unsigned long 
long)le64_to_cpu((*gpt)-first_usable_lba));
+   goto fail;
+   }
if (le64_to_cpu((*gpt)-first_usable_lba)  lastlba) {
pr_debug(GPT: first_usable_lba incorrect: %lld  %lld\n,
 (unsigned long 
long)le64_to_cpu((*gpt)-first_usable_lba),
-- 
1.7.4.1




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] partitions: efi: verify header is outside usable area

2012-09-05 Thread Davidlohr Bueso
The first usable logical block can be used by a GUID partition
entry, and therefore cannot be used by the header.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 7795bb4..abf33a2 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -363,6 +363,13 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
goto fail;
}
 
+   /* The header must be outside usable range */
+   if (le64_to_cpu((*gpt)-first_usable_lba)  lba 
+   le64_to_cpu((*gpt)-last_usable_lba)  lba) {
+   pr_debug(GPT: Header is inside usable area\n);
+   goto fail;
+   }
+
/* Check that sizeof_partition_entry has the correct value */
if (le32_to_cpu((*gpt)-sizeof_partition_entry) != sizeof(gpt_entry)) {
pr_debug(GUID Partitition Entry Size check failed.\n);
-- 
1.7.4.1




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] partitions: efi: check minimum header size

2012-09-05 Thread Davidlohr Bueso
As per UEFI specs 2.3.1 (June 2012),
The Header Size must be greater than 92 and must be less than
or equal to the logical block size

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 block/partitions/efi.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index abf33a2..3a5114e 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -25,6 +25,9 @@
  * TODO:
  *
  * Changelog:
+ * Sept. 2012 Davidlohr Bueso d...@gnu.org
+ * - tighten GPT header integrity verification.
+ *
  * Mon Nov 09 2004 Matt Domsch matt_dom...@dell.com
  * - test for valid PMBR and valid PGPT before ever reading
  *   AGPT, allow override with 'gpt' kernel command line option.
@@ -311,8 +314,8 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
}
 
/* Check the GUID Partition Table header size */
-   if (le32_to_cpu((*gpt)-header_size) 
-   bdev_logical_block_size(state-bdev)) {
+   if (le32_to_cpu((*gpt)-header_size) = 92 ||
+   le32_to_cpu((*gpt)-header_size)  
bdev_logical_block_size(state-bdev)) {
pr_debug(GUID Partition Table Header size is wrong: %u  %u\n,
le32_to_cpu((*gpt)-header_size),
bdev_logical_block_size(state-bdev));
-- 
1.7.4.1




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: VMX: invalidate vpid for invlpg instruction

2012-09-05 Thread Davidlohr Bueso
On Mon, 2012-09-03 at 12:11 +0300, Avi Kivity wrote:
 On 09/03/2012 02:27 AM, Davidlohr Bueso wrote:
  On Fri, 2012-08-31 at 14:37 -0300, Marcelo Tosatti wrote:
  On Fri, Aug 31, 2012 at 06:10:48PM +0200, Davidlohr Bueso wrote:
   For processors that support VPIDs we should invalidate the page table 
   entry
   specified by the lineal address. For this purpose add support for 
   individual
   address invalidations.
  
  Not necessary - a single context invalidation is performed through
  KVM_REQ_TLB_FLUSH.
  
  Since vpid_sync_context() supports both single and all-context vpid
  invalidations, wouldn't it make sense to also add individual address
  ones as well, supporting further granularity?
 
 It might.  Do you have benchmarks supporting this?
 

I ran two benchmarks: Java Dacapo[1] Sunflow (renders a set of images
using ray tracing) and a vanilla 3.2 kernel build (with 1 job and -j8).

The host configuration is an Intel i7-2635QM (4 cores + HT) with 4Gb RAM
running Linus's latest and only running standard system daemons. For KVM
I disabled EPT.
The guest configuration is a 64bit 4 core 4Gb RAM, running Linux 3.2
(debian) and only running the benchmark.

All results represent the mean of 5 runs, with time(1).

Dacapo without individual addr invvpid:
real   1m25.406s
user   4m59.315s
sys1m25.406s

Dacapo with individual addr invvpid:
real   1m4.421s
user   3m47.150s
sys0m1.592s

--

vanilla kernel build without individual addr invvpid:
real   16m42.571s
user   13m28.975s
sys2m54.487s

vanilla kernel build with individual addr invvpid:
real   15m45.789s
user   12m25.691s
sys2m44.806s

--

vanilla kernel build (-j8) without individual addr invvpid:
real   10m32.276s
user   33m47.687s
sys5m37.725s

vanilla kernel build (-j8) with individual addr invvpid:
real   8m29.789s
user   28m12.850s
sys4m34.353s


In all cases using individual address invalidation outperforms single
context ones regarding wall time. Comments?

[1] http://dacapobench.org/


Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] oom: remove deprecated oom_adj

2012-08-24 Thread Davidlohr Bueso
The deprecated /proc/pid/oom_adj is scheduled for removal this month.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 Documentation/ABI/obsolete/proc-pid-oom_adj |   22 -
 Documentation/feature-removal-schedule.txt  |   25 --
 Documentation/filesystems/proc.txt  |   22 +
 fs/proc/base.c  |  117 +--
 include/linux/oom.h |   11 ---
 include/linux/sched.h   |1 -
 kernel/fork.c   |1 -
 mm/oom_kill.c   |4 +-
 8 files changed, 7 insertions(+), 196 deletions(-)
 delete mode 100644 Documentation/ABI/obsolete/proc-pid-oom_adj

diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj 
b/Documentation/ABI/obsolete/proc-pid-oom_adj
deleted file mode 100644
index 9a3cb88..000
--- a/Documentation/ABI/obsolete/proc-pid-oom_adj
+++ /dev/null
@@ -1,22 +0,0 @@
-What:  /proc/pid/oom_adj
-When:  August 2012
-Why:   /proc/pid/oom_adj allows userspace to influence the oom killer's
-   badness heuristic used to determine which task to kill when the kernel
-   is out of memory.
-
-   The badness heuristic has since been rewritten since the introduction of
-   this tunable such that its meaning is deprecated.  The value was
-   implemented as a bitshift on a score generated by the badness()
-   function that did not have any precise units of measure.  With the
-   rewrite, the score is given as a proportion of available memory to the
-   task allocating pages, so using a bitshift which grows the score
-   exponentially is, thus, impossible to tune with fine granularity.
-
-   A much more powerful interface, /proc/pid/oom_score_adj, was
-   introduced with the oom killer rewrite that allows users to increase or
-   decrease the badness score linearly.  This interface will replace
-   /proc/pid/oom_adj.
-
-   A warning will be emitted to the kernel log if an application uses this
-   deprecated interface.  After it is printed once, future warnings will be
-   suppressed until the kernel is rebooted.
diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index afaff31..d369f59 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -115,31 +115,6 @@ Who:   Pavel Machek pa...@ucw.cz
 
 ---
 
-What:  /proc/pid/oom_adj
-When:  August 2012
-Why:   /proc/pid/oom_adj allows userspace to influence the oom killer's
-   badness heuristic used to determine which task to kill when the kernel
-   is out of memory.
-
-   The badness heuristic has since been rewritten since the introduction of
-   this tunable such that its meaning is deprecated.  The value was
-   implemented as a bitshift on a score generated by the badness()
-   function that did not have any precise units of measure.  With the
-   rewrite, the score is given as a proportion of available memory to the
-   task allocating pages, so using a bitshift which grows the score
-   exponentially is, thus, impossible to tune with fine granularity.
-
-   A much more powerful interface, /proc/pid/oom_score_adj, was
-   introduced with the oom killer rewrite that allows users to increase or
-   decrease the badness score linearly.  This interface will replace
-   /proc/pid/oom_adj.
-
-   A warning will be emitted to the kernel log if an application uses this
-   deprecated interface.  After it is printed once, future warnings will be
-   suppressed until the kernel is rebooted.
-

-
 What:  remove EXPORT_SYMBOL(kernel_thread)
 When:  August 2006
 Files: arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index fb0a6ae..a1793d6 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,7 @@ Table of Contents
   2Modifying System Parameters
 
   3Per-Process Parameters
-  3.1  /proc/pid/oom_adj  /proc/pid/oom_score_adj - Adjust the oom-killer
+  3.1  /proc/pid/oom_score_adj - Adjust the oom-killer
score
   3.2  /proc/pid/oom_score - Display current oom-killer score
   3.3  /proc/pid/io - Display the IO accounting fields
@@ -1320,10 +1320,10 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 --
 
-3.1 /proc/pid/oom_adj  /proc/pid/oom_score_adj- Adjust the oom-killer 
score
+3.1 /proc/pid/oom_score_adj- Adjust the oom-killer score
 

 
-These file can be used to adjust the badness heuristic used to select which
+This file can be used to adjust the badness heuristic used to select which
 process gets

[PATCH] mm: add node physical memory range to sysfs

2012-12-07 Thread Davidlohr Bueso
This patch adds a new 'memrange' file that shows the starting and
ending physical addresses that are associated to a node. This is
useful for identifying specific DIMMs within the system.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/base/node.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..f165a0a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -211,6 +211,19 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+static ssize_t node_read_memrange(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   int nid = dev-id;
+   unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn;
+   unsigned long end_pfn = start_pfn + NODE_DATA(nid)-node_spanned_pages;
+
+   return sprintf(buf, %#010Lx-%#010Lx\n,
+  (unsigned long long) start_pfn  PAGE_SHIFT,
+  (unsigned long long) (end_pfn  PAGE_SHIFT) - 1);
+}
+static DEVICE_ATTR(memrange, S_IRUGO, node_read_memrange, NULL);
+
 #ifdef CONFIG_HUGETLBFS
 /*
  * hugetlbfs per node attributes registration interface:
@@ -274,6 +287,7 @@ int register_node(struct node *node, int num, struct node 
*parent)
device_create_file(node-dev, dev_attr_numastat);
device_create_file(node-dev, dev_attr_distance);
device_create_file(node-dev, dev_attr_vmstat);
+   device_create_file(node-dev, dev_attr_memrange);
 
scan_unevictable_register_node(node);
 
@@ -299,6 +313,7 @@ void unregister_node(struct node *node)
device_remove_file(node-dev, dev_attr_numastat);
device_remove_file(node-dev, dev_attr_distance);
device_remove_file(node-dev, dev_attr_vmstat);
+   device_remove_file(node-dev, dev_attr_memrange);
 
scan_unevictable_unregister_node(node);
hugetlb_unregister_node(node);  /* no-op, if memoryless node */
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Documentation: ABI: /sys/devices/system/node/

2012-12-10 Thread Davidlohr Bueso
Describe NUMA node sysfs files/attributes.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
Note that for the specific dates and contacts I couldn't find,
I left it as default for Oct 2002 and linux-mm.

 Documentation/ABI/stable/sysfs-devices-node | 96 -
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node 
b/Documentation/ABI/stable/sysfs-devices-node
index 49b82ca..ce259c1 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -1,7 +1,101 @@
+What:  /sys/devices/system/node/possible
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Nodes that could be possibly become online at some point.
+
+What:  /sys/devices/system/node/online
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Nodes that are online.
+
+What:  /sys/devices/system/node/has_normal_memory
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Nodes that have regular memory.
+
+What:  /sys/devices/system/node/has_cpu
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Nodes that have one or more CPUs.
+
+What:  /sys/devices/system/node/has_high_memory
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Nodes that have regular or high memory.
+   Depends on CONFIG_HIGHMEM.
+
 What:  /sys/devices/system/node/nodeX
 Date:  October 2002
 Contact:   Linux Memory Management list linux...@kvack.org
 Description:
When CONFIG_NUMA is enabled, this is a directory containing
information on node X such as what CPUs are local to the
-   node.
+   node. Each file is detailed next.
+
+What:  /sys/devices/system/node/nodeX/cpumap
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   The node's cpumap.
+
+What:  /sys/devices/system/node/nodeX/cpulist
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   The CPUs associated to the node.
+
+What:  /sys/devices/system/node/nodeX/meminfo
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Provides information about the node's distribution and memory
+   utilization. Similar to /proc/meminfo, see 
Documentation/filesystems/proc.txt
+
+What:  /sys/devices/system/node/nodeX/numastat
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   The node's hit/miss statistics, in units of pages.
+   See Documentation/numastat.txt
+
+What:  /sys/devices/system/node/nodeX/distance
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   Distance between the node and all the other nodes
+   in the system.
+
+What:  /sys/devices/system/node/nodeX/vmstat
+Date:  October 2002
+Contact:   Linux Memory Management list linux...@kvack.org
+Description:
+   The node's zoned virtual memory statistics.
+   This is a superset of numastat.
+
+What:  /sys/devices/system/node/nodeX/compact
+Date:  February 2010
+Contact:   Mel Gorman m...@csn.ul.ie
+Description:
+   When this file is written to, all memory within that node
+   will be compacted. When it completes, memory will be freed
+   into blocks which have as many contiguous pages as possible
+
+What:  /sys/devices/system/node/nodeX/scan_unevictable_pages
+Date:  October 2008
+Contact:   Lee Schermerhorn lee.schermerh...@hp.com
+Description:
+   When set, it triggers scanning the node's unevictable lists
+   and move any pages that have become evictable onto the 
respective
+   zone's inactive list. See mm/vmscan.c
+
+What:  /sys/devices/system/node/nodeX/hugepages/hugepages-size/
+Date:  December 2009
+Contact:   Lee Schermerhorn lee.schermerh...@hp.com
+Description:
+   The node's huge page size control/query attributes.
+   See Documentation/vm/hugetlbpage.txt
\ No newline at end of file
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Documentation: ABI: remove testing/sysfs-devices-node

2012-12-20 Thread Davidlohr Bueso
This file is already documented in the stable ABI (commit 5bbe1ec1).

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 Documentation/ABI/testing/sysfs-devices-node | 7 ---
 1 file changed, 7 deletions(-)
 delete mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node 
b/Documentation/ABI/testing/sysfs-devices-node
deleted file mode 100644
index 453a210..000
--- a/Documentation/ABI/testing/sysfs-devices-node
+++ /dev/null
@@ -1,7 +0,0 @@
-What:  /sys/devices/system/node/nodeX/compact
-Date:  February 2010
-Contact:   Mel Gorman m...@csn.ul.ie
-Description:
-   When this file is written to, all memory within that node
-   will be compacted. When it completes, memory will be freed
-   into blocks which have as many contiguous pages as possible
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] staging: zram: simplify num_devices paramater

2013-01-01 Thread Davidlohr Bueso
Simplify dealing with num_devices when initializing zram.
Also cleanup some of the output messages.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/staging/zram/zram_drv.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index fb4a7c9..8115be9 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -40,7 +40,7 @@ static int zram_major;
 struct zram *zram_devices;
 
 /* Module params (documentation at end) */
-static unsigned int num_devices;
+static unsigned int num_devices = 1;
 
 static void zram_stat_inc(u32 *v)
 {
@@ -715,13 +715,7 @@ static int __init zram_init(void)
goto out;
}
 
-   if (!num_devices) {
-   pr_info(num_devices not specified. Using default: 1\n);
-   num_devices = 1;
-   }
-
/* Allocate the device array and initialize each one */
-   pr_info(Creating %u devices ...\n, num_devices);
zram_devices = kzalloc(num_devices * sizeof(struct zram), GFP_KERNEL);
if (!zram_devices) {
ret = -ENOMEM;
@@ -734,6 +728,8 @@ static int __init zram_init(void)
goto free_devices;
}
 
+   pr_info(Created %u device(s) ...\n, num_devices);
+
return 0;
 
 free_devices:
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] staging: zram: show correct disksize

2013-01-01 Thread Davidlohr Bueso
The -disksize variable stores values in units of bytes,
print the correct size in Kb

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/staging/zram/zram_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index 8115be9..10d7592 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -126,8 +126,7 @@ static void zram_set_disksize(struct zram *zram, size_t 
totalram_bytes)
\tMemory Size: %zu kB\n
\tSize you selected: %llu kB\n
Continuing anyway ...\n,
-   totalram_bytes  10, zram-disksize
-   );
+   totalram_bytes  10, zram-disksize  10);
}
 
zram-disksize = PAGE_MASK;
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] staging: zram: drop zram_stat_dec/inc functions

2013-01-01 Thread Davidlohr Bueso
It seems like an overkill to have adding and subtracting
1 functions from the 32bit counters. Just do it directly.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/staging/zram/zram_drv.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c
index 10d7592..6762b99 100644
--- a/drivers/staging/zram/zram_drv.c
+++ b/drivers/staging/zram/zram_drv.c
@@ -42,16 +42,6 @@ struct zram *zram_devices;
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 
-static void zram_stat_inc(u32 *v)
-{
-   *v = *v + 1;
-}
-
-static void zram_stat_dec(u32 *v)
-{
-   *v = *v - 1;
-}
-
 static void zram_stat64_add(struct zram *zram, u64 *v, u64 inc)
 {
spin_lock(zram-stat64_lock);
@@ -144,22 +134,22 @@ static void zram_free_page(struct zram *zram, size_t 
index)
 */
if (zram_test_flag(zram, index, ZRAM_ZERO)) {
zram_clear_flag(zram, index, ZRAM_ZERO);
-   zram_stat_dec(zram-stats.pages_zero);
+   zram-stats.pages_zero--;
}
return;
}
 
if (unlikely(size  max_zpage_size))
-   zram_stat_dec(zram-stats.bad_compress);
+   zram-stats.bad_compress--;
 
zs_free(zram-mem_pool, handle);
 
if (size = PAGE_SIZE / 2)
-   zram_stat_dec(zram-stats.good_compress);
+   zram-stats.good_compress--;
 
zram_stat64_sub(zram, zram-stats.compr_size,
zram-table[index].size);
-   zram_stat_dec(zram-stats.pages_stored);
+   zram-stats.pages_stored--;
 
zram-table[index].handle = 0;
zram-table[index].size = 0;
@@ -311,7 +301,7 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index,
kunmap_atomic(user_mem);
if (is_partial_io(bvec))
kfree(uncmem);
-   zram_stat_inc(zram-stats.pages_zero);
+   zram-stats.pages_zero++;
zram_set_flag(zram, index, ZRAM_ZERO);
ret = 0;
goto out;
@@ -330,7 +320,7 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index,
}
 
if (unlikely(clen  max_zpage_size)) {
-   zram_stat_inc(zram-stats.bad_compress);
+   zram-stats.bad_compress++;
src = uncmem;
clen = PAGE_SIZE;
}
@@ -353,9 +343,9 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index,
 
/* Update stats */
zram_stat64_add(zram, zram-stats.compr_size, clen);
-   zram_stat_inc(zram-stats.pages_stored);
+   zram-stats.pages_stored++;
if (clen = PAGE_SIZE / 2)
-   zram_stat_inc(zram-stats.good_compress);
+   zram-stats.good_compress++;
 
return 0;
 
-- 
1.7.11.7




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: zsmalloc: comment zs_create_pool function

2013-01-04 Thread Davidlohr Bueso
Just as with zs_malloc() and zs_map_object(), it is worth
formally commenting the zs_create_pool() function.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/staging/zsmalloc/zsmalloc-main.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
b/drivers/staging/zsmalloc/zsmalloc-main.c
index 09a9d35..eb00772 100644
--- a/drivers/staging/zsmalloc/zsmalloc-main.c
+++ b/drivers/staging/zsmalloc/zsmalloc-main.c
@@ -798,6 +798,17 @@ fail:
return notifier_to_errno(ret);
 }
 
+/**
+ * zs_create_pool - Creates an allocation pool to work from.
+ * @name: name of the pool to be created
+ * @flags: allocation flags used when growing pool
+ *
+ * This function must be called before anything when using
+ * the zsmalloc allocator.
+ *
+ * On success, a pointer to the newly created pool is returned,
+ * otherwise NULL.
+ */
 struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 {
int i, ovhd_size;
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: add node physical memory range to sysfs

2012-12-12 Thread Davidlohr Bueso
On Fri, 2012-12-07 at 16:17 -0800, Dave Hansen wrote:
 On 12/07/2012 03:51 PM, Andrew Morton wrote:
   +static ssize_t node_read_memrange(struct device *dev,
   +  struct device_attribute *attr, char 
   *buf)
   +{
   +int nid = dev-id;
   +unsigned long start_pfn = NODE_DATA(nid)-node_start_pfn;
   +unsigned long end_pfn = start_pfn + 
   NODE_DATA(nid)-node_spanned_pages;
  hm.  Is this correct for all for
  FLATMEM/SPARSEMEM/SPARSEMEM_VMEMMAP/DISCONTIGME/etc?
 
 It's not _wrong_ per se, but it's not super precise, either.
 
 The problem is, it's quite valid to have these node_start/spanned ranges
 overlap between two or more nodes on some hardware.  So, if the desired
 purpose is to map nodes to DIMMs, then this can only accomplish this on
 _some_ hardware, not all.  It would be completely useless for that
 purpose for some configurations.
 
 Seems like the better way to do this would be to expose the DIMMs
 themselves in some way, and then map _those_ back to a node.
 

Good point, and from a DIMM perspective, I agree, and will look into
this. However, IMHO, having the range of physical addresses for every
node still provides valuable information, from a NUMA point of view. For
example, dealing with node related e820 mappings.

Andrew, with the documentation patch, would you be wiling to pickup a v2
of this?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: add node physical memory range to sysfs

2012-12-12 Thread Davidlohr Bueso
On Wed, 2012-12-12 at 17:48 -0800, Dave Hansen wrote:
 On 12/12/2012 05:18 PM, Davidlohr Bueso wrote:
  On Fri, 2012-12-07 at 16:17 -0800, Dave Hansen wrote:
  Seems like the better way to do this would be to expose the DIMMs
  themselves in some way, and then map _those_ back to a node.
  
  Good point, and from a DIMM perspective, I agree, and will look into
  this. However, IMHO, having the range of physical addresses for every
  node still provides valuable information, from a NUMA point of view. For
  example, dealing with node related e820 mappings.
 
 But if we went and did it per-DIMM (showing which physical addresses and
 NUMA nodes a DIMM maps to), wouldn't that be redundant with this
 proposed interface?
 

If DIMMs overlap between nodes, then we wouldn't have an exact range for
a node in question. Having both approaches would complement each other.

 How do you plan to use this in practice, btw?
 

It started because I needed to recognize the address of a node to remove
it from the e820 mappings and have the system ignore the node's
memory.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: add node physical memory range to sysfs

2012-12-13 Thread Davidlohr Bueso
On Wed, 2012-12-12 at 20:49 -0800, Dave Hansen wrote:
 On 12/12/2012 06:03 PM, Davidlohr Bueso wrote:
  On Wed, 2012-12-12 at 17:48 -0800, Dave Hansen wrote:
  But if we went and did it per-DIMM (showing which physical addresses and
  NUMA nodes a DIMM maps to), wouldn't that be redundant with this
  proposed interface?
  
  If DIMMs overlap between nodes, then we wouldn't have an exact range for
  a node in question. Having both approaches would complement each other.
 
 How is that possible?  If NUMA nodes are defined by distances from CPUs
 to memory, how could a DIMM have more than a single distance to any
 given CPU?

Can't this occur when interleaving emulated nodes with physical ones?

 
  How do you plan to use this in practice, btw?
  
  It started because I needed to recognize the address of a node to remove
  it from the e820 mappings and have the system ignore the node's
  memory.
 
 Actually, now that I think about it, can you check in the
 /sys/devices/system/ directories for memory and nodes?  We have linkages
 there for each memory section to every NUMA node, and you can also
 derive the physical address from the phys_index in each section.  That
 should allow you to work out physical addresses for a given node.
 

I had looked at the memory-hotplug interface but found that this
'phys_index' doesn't include holes, while -node_spanned_pages does.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] PM/Hibernate: use rb_entry

2012-10-01 Thread Davidlohr Bueso
Since the software suspend extents are organized in an rbtree, use rb_entry
instead of container_of, as it is semantically more appropriate in order to
get a node as it is iterated.

Signed-off-by: Davidlohr Bueso d...@gnu.org
---
 kernel/power/swap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 3c9d764..7c33ed2 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -126,7 +126,7 @@ static int swsusp_extents_insert(unsigned long swap_offset)
 
/* Figure out where to put the new node */
while (*new) {
-   ext = container_of(*new, struct swsusp_extent, node);
+   ext = rb_entry(*new, struct swsusp_extent, node);
parent = *new;
if (swap_offset  ext-start) {
/* Try to merge */
-- 
1.7.9.5



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


srat: harsh hot-pluggable memory check?

2013-01-10 Thread Davidlohr Bueso
When parsing the memory affinity mappings in arch/x86/mm/srat.c:
acpi_numa_memory_affinity_init() I'm wondering if the hot-pluggable check is 
too harsh, 
as we consider an error if the hot-pluggable bit is set and 
CONFIG_MEMORY_HOTPLUG is not.

Based on the ACPI specs (v5):

If the Enabled bit is set and the Hot Pluggable bit is also set. The
system hardware supports hot-add and hot-remove of this memory
region.

This only mentions that the system supports hot-plugging, and IMHO if the
user decides not to use CONFIG_MEMORY_HOTPLUG, it shouldn't be considered an 
error.
Therefore would it be ok to drop the check? Or am I missing something?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: srat: harsh hot-pluggable memory check?

2013-01-11 Thread Davidlohr Bueso
On Thu, 2013-01-10 at 21:02 +0100, Andi Kleen wrote:
  This only mentions that the system supports hot-plugging, and IMHO if the
  user decides not to use CONFIG_MEMORY_HOTPLUG, it shouldn't be considered 
  an error.
  Therefore would it be ok to drop the check? Or am I missing something?
 
 The very strict checks were originally implemented because various early
 BIOS had largely fictional SRATs, and trusting them blindly caused
 boot failures or a lot of wasted memory for unnecessary hotplug zones. 
 The wasted memory was mainly a problem with the old memory hotplug
 implementation that pre-allocated memmaps, that's not a problem anymore.
 However there may be still some other failure cases.
 

Would you be willing to take a patch that drops this check then? Or do
you see any other scenario where it would still be valid?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ACPI: SRAT: report non-volatile memory in debug

2013-01-08 Thread Davidlohr Bueso
Just as with the other memory affinity flags, report
non-volatile memory with ACPI debug.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/acpi/numa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index cb31298..68077ac 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -116,12 +116,14 @@ acpi_table_print_srat_entry(struct acpi_subtable_header 
*header)
struct acpi_srat_mem_affinity *p =
(struct acpi_srat_mem_affinity *)header;
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
- SRAT Memory (0x%lx length 0x%lx) in 
proximity domain %d %s%s\n,
+ SRAT Memory (0x%lx length 0x%lx) in 
proximity domain %d %s%s%s\n,
  (unsigned long)p-base_address,
  (unsigned long)p-length,
  p-proximity_domain,
  (p-flags  ACPI_SRAT_MEM_ENABLED)?
  enabled : disabled,
+ (p-flags  
ACPI_SRAT_MEM_NON_VOLATILE)?
+  non-volatile : ,
  (p-flags  
ACPI_SRAT_MEM_HOT_PLUGGABLE)?
   hot-pluggable : ));
}
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: srat: simplify memory affinity init error handling

2013-01-08 Thread Davidlohr Bueso
The acpi_numa_memory_affinity_init() function can fail in several
scenarios, use a single point of error return.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 arch/x86/mm/srat.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 4ddf497..1100423 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -149,39 +149,40 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
int node, pxm;
 
if (srat_disabled())
-   return -1;
-   if (ma-header.length != sizeof(struct acpi_srat_mem_affinity)) {
-   bad_srat();
-   return -1;
-   }
+   goto err;
+   if (ma-header.length != sizeof(struct acpi_srat_mem_affinity))
+   goto badsrat;
if ((ma-flags  ACPI_SRAT_MEM_ENABLED) == 0)
-   return -1;
-
+   goto err;
if ((ma-flags  ACPI_SRAT_MEM_HOT_PLUGGABLE)  !save_add_info())
-   return -1;
+   goto err;
+
start = ma-base_address;
end = start + ma-length;
pxm = ma-proximity_domain;
if (acpi_srat_revision = 1)
pxm = 0xff;
+
node = setup_node(pxm);
if (node  0) {
printk(KERN_ERR SRAT: Too many proximity domains.\n);
-   bad_srat();
-   return -1;
+   goto badsrat;
}
 
-   if (numa_add_memblk(node, start, end)  0) {
-   bad_srat();
-   return -1;
-   }
+   if (numa_add_memblk(node, start, end)  0)
+   goto badsrat;
 
node_set(node, numa_nodes_parsed);
 
printk(KERN_INFO SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n,
   node, pxm,
   (unsigned long long) start, (unsigned long long) end - 1);
+
return 0;
+badsrat:
+   bad_srat();
+err:
+   return -1;
 }
 
 void __init acpi_numa_arch_fixup(void) {}
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI: SRAT: report non-volatile memory in debug

2013-01-08 Thread Davidlohr Bueso
On Wed, 2013-01-09 at 01:34 +0100, Rafael J. Wysocki wrote:
 On Tuesday, January 08, 2013 04:15:56 PM Davidlohr Bueso wrote:
  Just as with the other memory affinity flags, report
  non-volatile memory with ACPI debug.
 
 Looks kind of good, but -
 
  Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
  ---
   drivers/acpi/numa.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)
  
  diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
  index cb31298..68077ac 100644
  --- a/drivers/acpi/numa.c
  +++ b/drivers/acpi/numa.c
  @@ -116,12 +116,14 @@ acpi_table_print_srat_entry(struct 
  acpi_subtable_header *header)
  struct acpi_srat_mem_affinity *p =
  (struct acpi_srat_mem_affinity *)header;
  ACPI_DEBUG_PRINT((ACPI_DB_INFO,
  - SRAT Memory (0x%lx length 0x%lx) in 
  proximity domain %d %s%s\n,
  + SRAT Memory (0x%lx length 0x%lx) in 
  proximity domain %d %s%s%s\n,
(unsigned long)p-base_address,
(unsigned long)p-length,
p-proximity_domain,
(p-flags  ACPI_SRAT_MEM_ENABLED)?
enabled : disabled,
  + (p-flags  
  ACPI_SRAT_MEM_NON_VOLATILE)?
  +  non-volatile : ,
 
 - why did you put non-volatile before hot-pluggable?

No particular reason. Should I send a v2 with non-volatile at the end?

 
(p-flags  
  ACPI_SRAT_MEM_HOT_PLUGGABLE)?
 hot-pluggable : ));
  }
 
 Rafael
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: zsmalloc: comment zs_create_pool function

2013-01-09 Thread Davidlohr Bueso
ping?

On Fri, 2013-01-04 at 12:14 -0800, Davidlohr Bueso wrote:
 Just as with zs_malloc() and zs_map_object(), it is worth
 formally commenting the zs_create_pool() function.
 
 Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
 ---
  drivers/staging/zsmalloc/zsmalloc-main.c | 11 +++
  1 file changed, 11 insertions(+)
 
 diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
 b/drivers/staging/zsmalloc/zsmalloc-main.c
 index 09a9d35..eb00772 100644
 --- a/drivers/staging/zsmalloc/zsmalloc-main.c
 +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
 @@ -798,6 +798,17 @@ fail:
   return notifier_to_errno(ret);
  }
  
 +/**
 + * zs_create_pool - Creates an allocation pool to work from.
 + * @name: name of the pool to be created
 + * @flags: allocation flags used when growing pool
 + *
 + * This function must be called before anything when using
 + * the zsmalloc allocator.
 + *
 + * On success, a pointer to the newly created pool is returned,
 + * otherwise NULL.
 + */
  struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
  {
   int i, ovhd_size;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ACPI: SRAT: report non-volatile memory in debug

2013-01-09 Thread Davidlohr Bueso
Just as with the other memory affinity flags, report
non-volatile memory with ACPI debug.

Signed-off-by: Davidlohr Bueso davidlohr.bu...@hp.com
---
 drivers/acpi/numa.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index cb31298..2935d3a 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -116,14 +116,16 @@ acpi_table_print_srat_entry(struct acpi_subtable_header 
*header)
struct acpi_srat_mem_affinity *p =
(struct acpi_srat_mem_affinity *)header;
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
- SRAT Memory (0x%lx length 0x%lx) in 
proximity domain %d %s%s\n,
+ SRAT Memory (0x%lx length 0x%lx) in 
proximity domain %d %s%s%s\n,
  (unsigned long)p-base_address,
  (unsigned long)p-length,
  p-proximity_domain,
  (p-flags  ACPI_SRAT_MEM_ENABLED)?
  enabled : disabled,
  (p-flags  
ACPI_SRAT_MEM_HOT_PLUGGABLE)?
-  hot-pluggable : ));
+  hot-pluggable : ,
+ (p-flags  
ACPI_SRAT_MEM_NON_VOLATILE)?
+  non-volatile : ));
}
 #endif /* ACPI_DEBUG_OUTPUT */
break;
-- 
1.7.11.7



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] partitions/efi: compare first and last usable LBAs

2013-08-05 Thread Davidlohr Bueso
When verifying GPT header integrity, make sure that
first usable LBA is smaller than last usable LBA.

Signed-off-by: Davidlohr Bueso davidl...@hp.com
---
 block/partitions/efi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index ab6cd08..9a81c3b 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -409,7 +409,12 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
 (unsigned long long)lastlba);
goto fail;
}
-
+   if (le64_to_cpu((*gpt)-last_usable_lba)  
le64_to_cpu((*gpt)-first_usable_lba)) {
+   pr_debug(GPT: last_usable_lba incorrect: %lld  %lld\n,
+(unsigned long 
long)le64_to_cpu((*gpt)-last_usable_lba),
+(unsigned long 
long)le64_to_cpu((*gpt)-first_usable_lba));
+   goto fail;
+   }
/* Check that sizeof_partition_entry has the correct value */
if (le32_to_cpu((*gpt)-sizeof_partition_entry) != sizeof(gpt_entry)) {
pr_debug(GUID Partitition Entry Size check failed.\n);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8] partitions/efi: account for pmbr size in lba

2013-08-05 Thread Davidlohr Bueso
The partition that has the 0xEE (GPT protective), must
have the size in lba field set to the lesser of the size
of the disk minus one or 0x for larger disks.

Signed-off-by: Davidlohr Bueso davidl...@hp.com
---
 block/partitions/efi.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 4bf8165..ab6cd08 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -166,6 +166,7 @@ invalid:
 /**
  * is_pmbr_valid(): test Protective MBR for validity
  * @mbr: pointer to a legacy mbr structure
+ * @total_sectors: amount of sectors in the device
  *
  * Description: Checks for a valid protective or hybrid
  * master boot record (MBR). The validity of a pMBR depends
@@ -180,9 +181,9 @@ invalid:
  * Returns 0 upon invalid MBR, or GPT_MBR_PROTECTIVE or
  * GPT_MBR_HYBRID depending on the device layout.
  */
-static int is_pmbr_valid(legacy_mbr *mbr)
+static int is_pmbr_valid(legacy_mbr *mbr, sector_t total_sectors)
 {
-   int i, ret = 0; /* invalid by default */
+   int i, part = 0, ret = 0; /* invalid by default */
 
if (!mbr || le16_to_cpu(mbr-signature) != MSDOS_MBR_SIGNATURE)
goto done;
@@ -190,6 +191,7 @@ static int is_pmbr_valid(legacy_mbr *mbr)
for (i = 0; i  4; i++) {
ret = pmbr_part_valid(mbr-partition_record[i]);
if (ret == GPT_MBR_PROTECTIVE) {
+   part = i;
/*
 * Ok, we at least know that there's a protective MBR,
 * now check if there are other partition types for
@@ -206,6 +208,18 @@ check_hybrid:
if ((mbr-partition_record[i].os_type != 
EFI_PMBR_OSTYPE_EFI_GPT) 
(mbr-partition_record[i].os_type != 0x00))
ret = GPT_MBR_HYBRID;
+
+   /*
+* Protective MBRs take up the lesser of the whole disk
+* or 2 TiB (32bit LBA), ignoring the rest of the disk.
+*
+* Hybrid MBRs do not necessarily comply with this.
+*/
+   if (ret == GPT_MBR_PROTECTIVE) {
+   if (le32_to_cpu(mbr-partition_record[part].size_in_lba) !=
+   min((uint32_t) total_sectors - 1, 0x))
+   ret = 0;
+   }
 done:
return ret;
 }
@@ -567,6 +581,7 @@ static int find_valid_gpt(struct parsed_partitions *state, 
gpt_header **gpt,
gpt_header *pgpt = NULL, *agpt = NULL;
gpt_entry *pptes = NULL, *aptes = NULL;
legacy_mbr *legacymbr;
+   sector_t total_sectors = i_size_read(state-bdev-bd_inode)  9;
u64 lastlba;
 
if (!ptes)
@@ -580,7 +595,7 @@ static int find_valid_gpt(struct parsed_partitions *state, 
gpt_header **gpt,
goto fail;
 
read_lba(state, 0, (u8 *) legacymbr, sizeof (*legacymbr));
-   good_pmbr = is_pmbr_valid(legacymbr);
+   good_pmbr = is_pmbr_valid(legacymbr, total_sectors);
kfree(legacymbr);
 
if (!good_pmbr)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] partitions/efi: check pmbr record's starting lba

2013-08-05 Thread Davidlohr Bueso
Per the UEFI Specs 2.4, June 2013, the starting lba of the partition
that has the EFI GPT (0xEE) must be set to 0x0001 - this is obviously
the LBA of the GPT Partition Header.

Signed-off-by: Davidlohr Bueso davidl...@hp.com
---
 block/partitions/efi.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 3ebd3d8..6a997b1 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -151,9 +151,18 @@ static u64 last_lba(struct block_device *bdev)
 
 static inline int pmbr_part_valid(gpt_record *part)
 {
-if (part-os_type == EFI_PMBR_OSTYPE_EFI_GPT 
-le32_to_cpu(part-start_sector) == 1UL)
-return 1;
+if (part-os_type != EFI_PMBR_OSTYPE_EFI_GPT)
+goto invalid;
+
+/* set to 0x0001 (i.e., the LBA of the GPT Partition Header) */
+if (le32_to_cpu(part-starting_lba) != GPT_PRIMARY_PARTITION_TABLE_LBA)
+goto invalid;
+
+if (le32_to_cpu(part-start_sector) != 1UL)
+goto invalid;
+
+return 1;
+invalid:
 return 0;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >