Re: [PATCH 21/36] x86/tdx: Remove TDX_HCALL_ISSUE_STI

2022-06-13 Thread Lai Jiangshan
kernel.org, VMware Inc , 
linux-snps-...@lists.infradead.org, mgor...@suse.de, 
jacob.jun@linux.intel.com, Arnd Bergmann , 
ulli.kr...@googlemail.com, vgu...@kernel.org, linux-...@vger.kernel.org, 
j...@joshtriplett.org, rost...@goodmis.org, r...@vger.kernel.org, Borislav 
Petkov , bc...@quicinc.com, tsbog...@alpha.franken.de, 
linux-par...@vger.kernel.org, sudeep.ho...@arm.com, shawn...@kernel.org, 
da...@davemloft.net, kirill.shute...@linux.intel.com, dal...@libc.org, 
t...@atomide.com, amakha...@vmware.com, bjorn.anders...@linaro.org, "H. Peter 
Anvin" , sparcli...@vger.kernel.org, 
linux-hexa...@vger.kernel.org, linux-ri...@lists.infradead.org, Isaku Yamahata 
, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, Richard Weinberger , 
X86 ML , li...@armlinux.org.uk, Ingo Molnar 
, Albert Ou , paulmck@ker
 nel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, Paul Walmsley , 
linux-te...@vger.kernel.org, Namhyung Kim , Andy 
Shevchenko , jpoim...@kernel.org, Juergen 
Gross , mon...@monstr.eu, linux-m...@vger.kernel.org, Palmer 
Dabbelt , Anup Patel , 
i...@jurassic.park.msu.ru, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 8, 2022 at 10:48 PM Peter Zijlstra  wrote:
>
> Now that arch_cpu_idle() is expected to return with IRQs disabled,
> avoid the useless STI/CLI dance.
>
> Per the specs this is supposed to work, but nobody has yet relied up
> this behaviour so broken implementations are possible.

I'm totally newbie here.

The point of safe_halt() is that STI must be used and be used
directly before HLT to enable IRQ during the halting and stop
the halting if there is any IRQ.

In TDX case, STI must be used directly before the hypercall.
Otherwise, no IRQ can come and the vcpu would be stalled forever.

Although the hypercall has an "irq_disabled" argument.
But the hypervisor doesn't (and can't) touch the IRQ flags no matter
what the "irq_disabled" argument is.  The IRQ is not enabled during
the halting if the IRQ is disabled before the hypercall even if
irq_disabled=false.

The "irq_disabled" argument is used for workaround purposes:
https://lore.kernel.org/kvm/c020ee0b90c424a7010e979c9b32a28e9c488a51.1651774251.git.isaku.yamah...@intel.com/

Hope my immature/incorrect reply elicits a real response from
others.

Thanks
Lai


Re: [PATCH] kthread: kthread_bind fails to enforce CPU affinity (fixes kernel BUG at kernel/smpboot.c:134!)

2014-12-08 Thread Lai Jiangshan
On 12/08/2014 09:54 PM, Steven Rostedt wrote:
 On Mon,  8 Dec 2014 14:27:01 +1100
 Anton Blanchard an...@samba.org wrote:
 
 I have a busy ppc64le KVM box where guests sometimes hit the infamous
 kernel BUG at kernel/smpboot.c:134! issue during boot:

 BUG_ON(td-cpu != smp_processor_id());

 Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
 output confirms it:

 CPU: 0
 Comm: watchdog/130

 The issue is in kthread_bind where we set the cpus_allowed mask, but do
 not touch task_thread_info(p)-cpu. The scheduler assumes the previously
 scheduled CPU is in the cpus_allowed mask, but in this case we are
 moving a thread to another CPU so it is not.

 
 Does this happen always on boot up, and always with the watchdog thread?
 
 I followed the logic that starts the watchdog threads.
 
 watchdog_enable_all_cpus()
   smpboot_register_percpu-thread() {
 
 for_each_online_cpu(cpu) { ... }
 
 Where watchdog_enable_all_cpus() can be called by
 lockup_detector_init() before SMP is started, but also by
 proc_dowatchdog() which is called by the sysctl commands (after SMP is
 up and running).
 
 I noticed there's no get_online_cpus() anywhere, although the
 unregister_percpu_thread() has it. Is it possible that we created a
 thread on a CPU that wasn't fully online yet?
 
 Perhaps the following patch is needed? Even if this isn't the solution
 to this bug, it is probably needed as watchdog_enable_all_cpus() can be
 called after boot up too.
 
 -- Steve


Hi, Steven, tglx

See this https://lkml.org/lkml/2014/7/30/804
[PATCH] smpboot: add missing get_online_cpus() when register


Thanks,
Lai

 
 diff --git a/kernel/smpboot.c b/kernel/smpboot.c
 index eb89e1807408..60d35ac5d3f1 100644
 --- a/kernel/smpboot.c
 +++ b/kernel/smpboot.c
 @@ -279,6 +279,7 @@ int smpboot_register_percpu_thread(struct 
 smp_hotplug_thread *plug_thread)
   unsigned int cpu;
   int ret = 0;
  
 + get_online_cpus();
   mutex_lock(smpboot_threads_lock);
   for_each_online_cpu(cpu) {
   ret = __smpboot_create_thread(plug_thread, cpu);
 @@ -291,6 +292,7 @@ int smpboot_register_percpu_thread(struct 
 smp_hotplug_thread *plug_thread)
   list_add(plug_thread-list, hotplug_threads);
  out:
   mutex_unlock(smpboot_threads_lock);
 + put_online_cpus();
   return ret;
  }
  EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 .
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 1/2] workqueue: use the nearest NUMA node, not the local one

2014-07-18 Thread Lai Jiangshan
Hi,

I'm curious about what will it happen when alloc_pages_node(memoryless_node).

If the memory is allocated from the most preferable node for the 
@memoryless_node,
why we need to bother and use cpu_to_mem() in the caller site?

If not, why the memory allocation subsystem refuses to find a preferable node
for @memoryless_node in this case? Does it intend on some purpose or
it can't find in some cases?

Thanks,
Lai

Added CC to Tejun (workqueue maintainer).

On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
 In the presence of memoryless nodes, the workqueue code incorrectly uses
 cpu_to_node() to determine what node to prefer memory allocations come
 from. cpu_to_mem() should be used instead, which will use the nearest
 NUMA node with memory.
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 diff --git a/kernel/workqueue.c b/kernel/workqueue.c
 index 35974ac..0bba022 100644
 --- a/kernel/workqueue.c
 +++ b/kernel/workqueue.c
 @@ -3547,7 +3547,12 @@ static struct worker_pool *get_unbound_pool(const 
 struct workqueue_attrs *attrs)
   for_each_node(node) {
   if (cpumask_subset(pool-attrs-cpumask,
  wq_numa_possible_cpumask[node])) {
 - pool-node = node;
 + /*
 +  * We could use local_memory_node(node) here,
 +  * but it is expensive and the following caches
 +  * the same value.
 +  */
 + pool-node = 
 cpu_to_mem(cpumask_first(pool-attrs-cpumask));
   break;
   }
   }
 @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
   pool-cpu = cpu;
   cpumask_copy(pool-attrs-cpumask, cpumask_of(cpu));
   pool-attrs-nice = std_nice[i++];
 - pool-node = cpu_to_node(cpu);
 + pool-node = cpu_to_mem(cpu);
  
   /* alloc pool ID */
   mutex_lock(wq_pool_mutex);
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/9] powerpc,kvm: fix imbalance srcu_read_[un]lock()

2013-03-15 Thread Lai Jiangshan
At the point of up_out label in kvmppc_hv_setup_htab_rma(),
srcu read lock is still held.

We have to release it before return.

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Gleb Natapov g...@redhat.com
Cc: Alexander Graf ag...@suse.de
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: k...@vger.kernel.org
Cc: kvm-...@vger.kernel.org
---
 arch/powerpc/kvm/book3s_hv.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80dcc53..c26740e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1799,7 +1799,7 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu)
 
  up_out:
up_read(current-mm-mmap_sem);
-   goto out;
+   goto out_srcu;
 }
 
 int kvmppc_core_init_vm(struct kvm *kvm)
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:20, Oleg Nesterov wrote:
 On 03/02, Lai Jiangshan wrote:

 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 +switch (__this_cpu_read(*lgrw-reader_refcnt)) {
 +case 1:
 +__this_cpu_write(*lgrw-reader_refcnt, 0);
 +lg_local_unlock(lgrw-lglock);
 +return;
 +case FALLBACK_BASE:
 +__this_cpu_write(*lgrw-reader_refcnt, 0);
 +read_unlock(lgrw-fallback_rwlock);
 +rwlock_release(lg-lock_dep_map, 1, _RET_IP_);
 
 I guess case 1: should do rwlock_release() too.

Already do it in lg_local_unlock(lgrw-lglock); before it returns.
(I like reuse old code)

 
 Otherwise, at first glance looks correct...
 
 However, I still think that FALLBACK_BASE only adds the unnecessary
 complications. But even if I am right this is subjective of course, please
 feel free to ignore.

OK, I kill FALLBACK_BASE in later patch.

 
 And btw, I am not sure about lg-lock_dep_map, perhaps we should use
 fallback_rwlock-dep_map ?

Use either one is OK.

 
 We need rwlock_acquire_read() even in the fast-path, and this acquire_read
 should be paired with rwlock_acquire() in _write_lock(), but it does
 spin_acquire(lg-lock_dep_map). Yes, currently this is the same (afaics)
 but perhaps fallback_rwlock-dep_map would be more clean.
 

I can't tell which one is better. I try to use fallback_rwlock-dep_map later.

 Oleg.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:11, Srivatsa S. Bhat wrote:
 On 03/02/2013 06:44 PM, Lai Jiangshan wrote:
 From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
 From: Lai Jiangshan la...@cn.fujitsu.com
 Date: Sat, 2 Mar 2013 20:33:14 +0800
 Subject: [PATCH] lglock: add read-preference local-global rwlock

 Current lglock is not read-preference, so it can't be used on some cases
 which read-preference rwlock can do. Example, get_cpu_online_atomic().

 [...]
 diff --git a/kernel/lglock.c b/kernel/lglock.c
 index 6535a66..52e9b2c 100644
 --- a/kernel/lglock.c
 +++ b/kernel/lglock.c
 @@ -87,3 +87,71 @@ void lg_global_unlock(struct lglock *lg)
  preempt_enable();
  }
  EXPORT_SYMBOL(lg_global_unlock);
 +
 +#define FALLBACK_BASE   (1UL  30)
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 +struct lglock *lg = lgrw-lglock;
 +
 +preempt_disable();
 +if (likely(!__this_cpu_read(*lgrw-reader_refcnt))) {
 +rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
 +if (unlikely(!arch_spin_trylock(this_cpu_ptr(lg-lock {
 +read_lock(lgrw-fallback_rwlock);
 +__this_cpu_write(*lgrw-reader_refcnt, FALLBACK_BASE);
 +return;
 +}
 +}
 +
 +__this_cpu_inc(*lgrw-reader_refcnt);
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
 +
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 +switch (__this_cpu_read(*lgrw-reader_refcnt)) {
 +case 1:
 +__this_cpu_write(*lgrw-reader_refcnt, 0);
 +lg_local_unlock(lgrw-lglock);
 +return;
 
 This should be a break, instead of a return, right?
 Otherwise, there will be a preempt imbalance...


lockdep and preempt are handled in lg_local_unlock(lgrw-lglock);

Thanks,
Lai

 
 +case FALLBACK_BASE:
 +__this_cpu_write(*lgrw-reader_refcnt, 0);
 +read_unlock(lgrw-fallback_rwlock);
 +rwlock_release(lg-lock_dep_map, 1, _RET_IP_);
 +break;
 +default:
 +__this_cpu_dec(*lgrw-reader_refcnt);
 +break;
 +}
 +
 +preempt_enable();
 +}
 
 
 Regards,
 Srivatsa S. Bhat
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-03-05 Thread Lai Jiangshan
On 02/03/13 03:47, Srivatsa S. Bhat wrote:
 On 03/01/2013 11:20 PM, Lai Jiangshan wrote:
 On 28/02/13 05:19, Srivatsa S. Bhat wrote:
 On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
 On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:

 Hi Lai,

 I'm really not convinced that piggy-backing on lglocks would help
 us in any way. But still, let me try to address some of the points
 you raised...

 On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 [...]

 Unfortunately, I see quite a few issues with the code above. IIUC, the
 writer and the reader both increment the same counters. So how will 
 the
 unlock() code in the reader path know when to unlock which of the 
 locks?

 The same as your code, the reader(which nested in write C.S.) just dec
 the counters.

 And that works fine in my case because the writer and the reader update
 _two_ _different_ counters.

 I can't find any magic in your code, they are the same counter.

 /*
  * It is desirable to allow the writer to acquire the 
 percpu-rwlock
  * for read (if necessary), without deadlocking or getting 
 complaints
  * from lockdep. To achieve that, just increment the 
 reader_refcnt of
  * this CPU - that way, any attempt by the writer to acquire the
  * percpu-rwlock for read, will get treated as a case of nested 
 percpu
  * reader, which is safe, from a locking perspective.
  */
 this_cpu_inc(pcpu_rwlock-rw_state-reader_refcnt);


 Whoa! Hold on, were you really referring to _this_ increment when you said
 that, in your patch you would increment the refcnt at the writer? Then I 
 guess
 there is a major disconnect in our conversations. (I had assumed that you 
 were
 referring to the update of writer_signal, and were just trying to have a 
 single
 refcnt instead of reader_refcnt and writer_signal).

 https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d

 Sorry the name fallback_reader_refcnt misled you.

 [...]

 All I was considered is nested reader is seldom, so I always
 fallback to rwlock when nested.
 If you like, I can add 6 lines of code, the overhead is
 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()


 I'm assuming that calculation is no longer valid, considering that
 we just discussed how the per-cpu refcnt that you were using is quite
 unnecessary and can be removed.

 IIUC, the overhead with your code, as per above discussion would be:
 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).

 https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16

 Again, I'm so sorry the name fallback_reader_refcnt misled you.


 At this juncture I really have to admit that I don't understand your
 intentions at all. What are you really trying to prove? Without giving
 a single good reason why my code is inferior, why are you even bringing
 up the discussion about a complete rewrite of the synchronization code?
 http://article.gmane.org/gmane.linux.kernel.cross-arch/17103
 http://article.gmane.org/gmane.linux.power-management.general/31345

 I'm beginning to add 2 + 2 together based on the kinds of questions you
 have been asking...

 You posted a patch in this thread and started a discussion around it without
 even establishing a strong reason to do so. Now you point me to your git
 tree where your patches have even more traces of ideas being borrowed from
 my patchset (apart from my own ideas/code, there are traces of others' ideas
 being borrowed too - for example, it was Oleg who originally proposed the
 idea of splitting up the counter into 2 parts and I'm seeing that it is
 slowly crawling into your code with no sign of appropriate credits).
 http://article.gmane.org/gmane.linux.network/260288

 And in reply to my mail pointing out the performance implications of the
 global read_lock at the reader side in your code, you said you'll come up
 with a comparison between that and my patchset.
 http://article.gmane.org/gmane.linux.network/260288
 The issue has been well-documented in my patch description of patch 4.
 http://article.gmane.org/gmane.linux.kernel/1443258

 Are you really trying to pit bits and pieces of my own ideas/versions
 against one another and claiming them as your own?

 You projected the work involved in handling the locking issues pertaining
 to CPU_DYING notifiers etc as a TODO, despite the fact that I had explicitly
 noted in my cover letter that I had audited

Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:06, Oleg Nesterov wrote:
 On 03/02, Michel Lespinasse wrote:

 My version would be slower if it needs to take the
 slow path in a reentrant way, but I'm not sure it matters either :)
 
 I'd say, this doesn't matter at all, simply because this can only happen
 if we race with the active writer.
 

It can also happen when interrupted. (still very rarely)

arch_spin_trylock()
---interrupted,
__this_cpu_read() returns 0.
arch_spin_trylock() fails
slowpath, any nested will be slowpath too.
...
..._read_unlock()
---interrupt
__this_cpu_inc()



I saw get_online_cpu_atomic() is called very frequent.
And the above thing happens in one CPU rarely, but how often it
happens in the whole system if we have 4096 CPUs?
(I worries to much. I tend to remove FALLBACK_BASE now, we should
add it only after we proved we needed it, this part is not proved)

Thanks,
Lai


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH V2] lglock: add read-preference local-global rwlock

2013-03-02 Thread Lai Jiangshan
From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan la...@cn.fujitsu.com
Date: Sat, 2 Mar 2013 20:33:14 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

 CPU 0CPU 1
 --   --
1.spin_lock(random_lock); read_lock(my_rwlock of CPU 1);
2.read_lock(my_rwlock of CPU 0);   spin_lock(random_lock);

Writer:
 CPU 2:
 --
  for_each_online_cpu(cpu)
write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds  Waiting for
 random_lock) CPU 0 - CPU 2  (holds my_rwlock of CPU 0
   for write)
   ^   |
   |   |
Waiting|   | Waiting
  for  |   |  for
   |   V
-- CPU 1 --

(holds my_rwlock of
 CPU 1 for read)


So obviously this straight-forward way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
1) try to lock its cpu's lock of the frontend lglock.
   (reader count +=1 if success)
2) if the above step fails, read_lock(fallback_rwlock).
   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(fallback_rwlock).


Prof:
1) reader-writer exclusive:
write-site must requires all percpu locks and fallback_rwlock.
outmost read-site must requires one of these locks.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(fallback_rwlock) due to rwlock is read-preference.

3) read site functions are irqsafe(reentrance-safe)
   (read site functions is not protected by disabled irq, but they are irqsafe)
If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter  0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the 
lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already 
reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by Srivatsa S. Bhat. (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

nested=1(no nested) nested=2nested=4
opt-rwlock   517181 1009200 2010027
lgrwlock 452897  700026 1201415
percpu-rwlock   1192955 1451343 1951757

The value is the time(nano-second) of 1 times of the operations
{
read-lock
[nested read

Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-02 Thread Lai Jiangshan
On 02/03/13 02:28, Oleg Nesterov wrote:
 Lai, I didn't read this discussion except the code posted by Michel.
 I'll try to read this patch carefully later, but I'd like to ask
 a couple of questions.
 
 This version looks more complex than Michel's, why? Just curious, I
 am trying to understand what I missed. See
 http://marc.info/?l=linux-kernelm=136196350213593

Michel changed my old draft version a little, his version is good enough for me.
My new version tries to add a little better nestable support with only
adding single __this_cpu_op() in _read_[un]lock().

 
 And I can't understand FALLBACK_BASE...
 
 OK, suppose that CPU_0 does _write_unlock() and releases -fallback_rwlock.
 
 CPU_1 does _read_lock(), and ...
 
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 +struct lglock *lg = lgrw-lglock;
 +
 +preempt_disable();
 +rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
 +if (likely(!__this_cpu_read(*lgrw-reader_refcnt))) {
 +if (!arch_spin_trylock(this_cpu_ptr(lg-lock))) {
 
 _trylock() fails,
 
 +read_lock(lgrw-fallback_rwlock);
 +__this_cpu_add(*lgrw-reader_refcnt, FALLBACK_BASE);
 
 so we take -fallback_rwlock and -reader_refcnt == FALLBACK_BASE.
 
 CPU_0 does lg_global_unlock(lgrw-lglock) and finishes _write_unlock().
 
 Interrupt handler on CPU_1 does _read_lock() notices -reader_refcnt != 0
 and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
 
 Then irq does _read_unlock(), and
 
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 +switch (__this_cpu_dec_return(*lgrw-reader_refcnt)) {
 +case 0:
 +lg_local_unlock(lgrw-lglock);
 +return;
 +case FALLBACK_BASE:
 +__this_cpu_sub(*lgrw-reader_refcnt, FALLBACK_BASE);
 +read_unlock(lgrw-fallback_rwlock);
 
 hits this case?
 
 Doesn't look right, but most probably I missed something.

Your are right, I just realized that I had spit a code which should be atomic.

I hope this patch(V2) can get more reviews.

My first and many locking knowledge is learned from Paul.
Paul, would you also review it?

Thanks,
Lai

 
 Oleg.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] lglock: add read-preference local-global rwlock

2013-03-01 Thread Lai Jiangshan
From c63f2be9a4cf7106a521dda169a0e14f8e4f7e3b Mon Sep 17 00:00:00 2001
From: Lai Jiangshan la...@cn.fujitsu.com
Date: Mon, 25 Feb 2013 23:14:27 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

 CPU 0CPU 1
 --   --
1.spin_lock(random_lock); read_lock(my_rwlock of CPU 1);
2.read_lock(my_rwlock of CPU 0);   spin_lock(random_lock);

Writer:
 CPU 2:
 --
  for_each_online_cpu(cpu)
write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds  Waiting for
 random_lock) CPU 0 - CPU 2  (holds my_rwlock of CPU 0
   for write)
   ^   |
   |   |
Waiting|   | Waiting
  for  |   |  for
   |   V
-- CPU 1 --

(holds my_rwlock of
 CPU 1 for read)


So obviously this straight-forward way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
1) try to lock its cpu's lock of the frontend lglock.
   (reader count +=1 if success)
2) if the above step fails, read_lock(fallback_rwlock).
   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(fallback_rwlock).


Prof:
1) reader-writer exclusive:
The two steps of write site finished, no reader. Vice-verse.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(fallback_rwlock) for rwlock is read-preference.
read-preference also implies nestable.

3) read site functions are irqsafe(reentrance-safe)
If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter  0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the 
lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already 
reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by Srivatsa S. Bhat. (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

nested=1(no nested) nested=2nested=4
opt-rwlock   517181 1009200 2010027
lgrwlock 452897  700026 1201415
percpu-rwlock   1192955 1451343 1951757

The value is the time(nano-second) of 1 times of the operations:
{
read-lock
[nested read-lock]...
[nested read-unlock]...
read-unlock
}
If the way of this test is wrong

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-03-01 Thread Lai Jiangshan
On 28/02/13 05:19, Srivatsa S. Bhat wrote:
 On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
 On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:

 Hi Lai,

 I'm really not convinced that piggy-backing on lglocks would help
 us in any way. But still, let me try to address some of the points
 you raised...

 On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 [...]

 Unfortunately, I see quite a few issues with the code above. IIUC, the
 writer and the reader both increment the same counters. So how will the
 unlock() code in the reader path know when to unlock which of the locks?

 The same as your code, the reader(which nested in write C.S.) just dec
 the counters.

 And that works fine in my case because the writer and the reader update
 _two_ _different_ counters.

 I can't find any magic in your code, they are the same counter.

 /*
  * It is desirable to allow the writer to acquire the percpu-rwlock
  * for read (if necessary), without deadlocking or getting 
 complaints
  * from lockdep. To achieve that, just increment the reader_refcnt 
 of
  * this CPU - that way, any attempt by the writer to acquire the
  * percpu-rwlock for read, will get treated as a case of nested 
 percpu
  * reader, which is safe, from a locking perspective.
  */
 this_cpu_inc(pcpu_rwlock-rw_state-reader_refcnt);


 Whoa! Hold on, were you really referring to _this_ increment when you said
 that, in your patch you would increment the refcnt at the writer? Then I 
 guess
 there is a major disconnect in our conversations. (I had assumed that you 
 were
 referring to the update of writer_signal, and were just trying to have a 
 single
 refcnt instead of reader_refcnt and writer_signal).

 https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d

 Sorry the name fallback_reader_refcnt misled you.

 [...]
 
 All I was considered is nested reader is seldom, so I always
 fallback to rwlock when nested.
 If you like, I can add 6 lines of code, the overhead is
 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()


 I'm assuming that calculation is no longer valid, considering that
 we just discussed how the per-cpu refcnt that you were using is quite
 unnecessary and can be removed.

 IIUC, the overhead with your code, as per above discussion would be:
 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).

 https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16

 Again, I'm so sorry the name fallback_reader_refcnt misled you.

 
 At this juncture I really have to admit that I don't understand your
 intentions at all. What are you really trying to prove? Without giving
 a single good reason why my code is inferior, why are you even bringing
 up the discussion about a complete rewrite of the synchronization code?
 http://article.gmane.org/gmane.linux.kernel.cross-arch/17103
 http://article.gmane.org/gmane.linux.power-management.general/31345
 
 I'm beginning to add 2 + 2 together based on the kinds of questions you
 have been asking...
 
 You posted a patch in this thread and started a discussion around it without
 even establishing a strong reason to do so. Now you point me to your git
 tree where your patches have even more traces of ideas being borrowed from
 my patchset (apart from my own ideas/code, there are traces of others' ideas
 being borrowed too - for example, it was Oleg who originally proposed the
 idea of splitting up the counter into 2 parts and I'm seeing that it is
 slowly crawling into your code with no sign of appropriate credits).
 http://article.gmane.org/gmane.linux.network/260288
 
 And in reply to my mail pointing out the performance implications of the
 global read_lock at the reader side in your code, you said you'll come up
 with a comparison between that and my patchset.
 http://article.gmane.org/gmane.linux.network/260288
 The issue has been well-documented in my patch description of patch 4.
 http://article.gmane.org/gmane.linux.kernel/1443258
 
 Are you really trying to pit bits and pieces of my own ideas/versions
 against one another and claiming them as your own?
 
 You projected the work involved in handling the locking issues pertaining
 to CPU_DYING notifiers etc as a TODO, despite the fact that I had explicitly
 noted in my cover letter that I had audited and taken care of all of them.
 http://article.gmane.org/gmane.linux.documentation/9727

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-26 Thread Lai Jiangshan
On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 [...]
 I wrote an untested draft here.

 Thanks,
 Lai

 PS: Some HA tools(I'm writing one) which takes checkpoints of
 virtual-machines frequently, I guess this patchset can speedup the
 tools.

 From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
 From: Lai Jiangshan la...@cn.fujitsu.com
 Date: Mon, 25 Feb 2013 23:14:27 +0800
 Subject: [PATCH] lglock: add read-preference local-global rwlock

 locality via lglock(trylock)
 read-preference read-write-lock via fallback rwlock_t

 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
  include/linux/lglock.h |   31 +++
  kernel/lglock.c|   45 
 +
  2 files changed, 76 insertions(+), 0 deletions(-)

 diff --git a/include/linux/lglock.h b/include/linux/lglock.h
 index 0d24e93..30fe887 100644
 --- a/include/linux/lglock.h
 +++ b/include/linux/lglock.h
 @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
  void lg_global_lock(struct lglock *lg);
  void lg_global_unlock(struct lglock *lg);

 +struct lgrwlock {
 + unsigned long __percpu *fallback_reader_refcnt;
 + struct lglock lglock;
 + rwlock_t fallback_rwlock;
 +};
 +
 +#define DEFINE_LGRWLOCK(name) 
\
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + struct lgrwlock name = {\
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +#define DEFINE_STATIC_LGRWLOCK(name) \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + static struct lgrwlock name = { \
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
 +{
 + lg_lock_init(lgrw-lglock, name);
 +}
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
  #endif
 diff --git a/kernel/lglock.c b/kernel/lglock.c
 index 6535a66..463543a 100644
 --- a/kernel/lglock.c
 +++ b/kernel/lglock.c
 @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
   preempt_enable();
  }
  EXPORT_SYMBOL(lg_global_unlock);
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 + struct lglock *lg = lgrw-lglock;
 +
 + preempt_disable();
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + if (likely(arch_spin_trylock(this_cpu_ptr(lg-lock {
 + rwlock_acquire_read(lg-lock_dep_map, 0, 0, 
 _RET_IP_);
 + return;
 + }
 + read_lock(lgrw-fallback_rwlock);
 + }
 +
 + __this_cpu_inc(*lgrw-fallback_reader_refcnt);
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
 +
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + lg_local_unlock(lgrw-lglock);
 + return;
 + }
 +
 + if (!__this_cpu_dec_return(*lgrw-fallback_reader_refcnt))
 + read_unlock(lgrw-fallback_rwlock);
 +
 + preempt_enable();
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
 +

 If I read the code above correctly, all you are doing is implementing a
 recursive reader-side primitive (ie., allowing the reader to call these
 functions recursively, without resulting in a self-deadlock).

 But the thing is, making the reader-side recursive is the least of our
 problems! Our main challenge is to make the locking extremely flexible
 and also safe-guard it against circular-locking-dependencies and deadlocks.
 Please take a look at the changelog of patch 1 - it explains the situation
 with an example.


 My lock fixes your requirements(I read patch 1-6 before I sent). In
 readsite, lglock 's lock is token via trylock, the lglock doesn't
 contribute to deadlocks, we can consider it doesn't exist when we find
 deadlock from it. And global fallback rwlock doesn't result to
 deadlocks because it is read

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-26 Thread Lai Jiangshan
On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 A question: How did you find out the such usages of
 preempt_disable() and convert them? did all are converted?


 Well, I scanned through the source tree for usages which implicitly
 disabled CPU offline and converted them over.

How do you scan? could you show the way you scan the source tree.
I can follow your instructions for double checking.

 Its not limited to uses
 of preempt_disable() alone - even spin_locks, rwlocks, local_irq_disable()
 etc also help disable CPU offline. So I tried to dig out all such uses
 and converted them. However, since the merge window is open, a lot of
 new code is flowing into the tree. So I'll have to rescan the tree to
 see if there are any more places to convert.

I remember some code has such assumption:
preempt_disable() (or something else)
//the code assume that the cpu_online_map can't be changed.
preempt_enable()

It is very hard to find out all such kinds of assumptions and fixes them.
(I notice your code mainly fixes code around send_())



 And I think the lock is too complex and reinvent the wheel, why don't
 you reuse the lglock?

 lglocks? No way! ;-) See below...

 I wrote an untested draft here.

 Thanks,
 Lai

 PS: Some HA tools(I'm writing one) which takes checkpoints of
 virtual-machines frequently, I guess this patchset can speedup the
 tools.

 From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
 From: Lai Jiangshan la...@cn.fujitsu.com
 Date: Mon, 25 Feb 2013 23:14:27 +0800
 Subject: [PATCH] lglock: add read-preference local-global rwlock

 locality via lglock(trylock)
 read-preference read-write-lock via fallback rwlock_t

 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
  include/linux/lglock.h |   31 +++
  kernel/lglock.c|   45 +
  2 files changed, 76 insertions(+), 0 deletions(-)

 diff --git a/include/linux/lglock.h b/include/linux/lglock.h
 index 0d24e93..30fe887 100644
 --- a/include/linux/lglock.h
 +++ b/include/linux/lglock.h
 @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
  void lg_global_lock(struct lglock *lg);
  void lg_global_unlock(struct lglock *lg);

 +struct lgrwlock {
 + unsigned long __percpu *fallback_reader_refcnt;
 + struct lglock lglock;
 + rwlock_t fallback_rwlock;
 +};
 +
 +#define DEFINE_LGRWLOCK(name)   
  \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + struct lgrwlock name = {\
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +#define DEFINE_STATIC_LGRWLOCK(name) \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + static struct lgrwlock name = { \
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
 +{
 + lg_lock_init(lgrw-lglock, name);
 +}
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
  #endif
 diff --git a/kernel/lglock.c b/kernel/lglock.c
 index 6535a66..463543a 100644
 --- a/kernel/lglock.c
 +++ b/kernel/lglock.c
 @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
   preempt_enable();
  }
  EXPORT_SYMBOL(lg_global_unlock);
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 + struct lglock *lg = lgrw-lglock;
 +
 + preempt_disable();
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + if (likely(arch_spin_trylock(this_cpu_ptr(lg-lock {
 + rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
 + return;
 + }
 + read_lock(lgrw-fallback_rwlock);
 + }
 +
 + __this_cpu_inc(*lgrw-fallback_reader_refcnt);
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
 +
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + lg_local_unlock(lgrw-lglock);
 + return

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-26 Thread Lai Jiangshan
On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:
 Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
 lock-ordering related problems (unlike per-cpu locks). However, global
 rwlocks lead to unnecessary cache-line bouncing even when there are no
 writers present, which can slow down the system needlessly.

per-CPU rwlocks(yours and mine) are the exactly same as rwlock_t in
the view of lock dependency(except reader-C.S. can be nested in
writer-C.S.)
so they can deadlock in this order:

spin_lock(some_lock);   percpu_write_lock_irqsave()
case CPU_DYING
percpu_read_lock_irqsafe(); ---deadlock---spin_lock(some_lock);

The lockdep can find out such dependency, but we must try our best to
find out them before merge the patchset to mainline. We can  review
all the code of cpu_disable() and CPU_DYING and fix this kinds of lock
dependency, but it is not easy thing, it may be a long term project.

==
And if there is any CPU_DYING code takes no locks and do some
works(because they know they are called via stop_machine()) we need to
add that locking code back if there is such code.(I don't know whether
such code exist or not)


 Per-cpu counters can help solve the cache-line bouncing problem. So we
 actually use the best of both: per-cpu counters (no-waiting) at the reader
 side in the fast-path, and global rwlocks in the slowpath.

 [ Fastpath = no writer is active; Slowpath = a writer is active ]

 IOW, the readers just increment/decrement their per-cpu refcounts (disabling
 interrupts during the updates, if necessary) when no writer is active.
 When a writer becomes active, he signals all readers to switch to global
 rwlocks for the duration of his activity. The readers switch over when it
 is safe for them (ie., when they are about to start a fresh, non-nested
 read-side critical section) and start using (holding) the global rwlock for
 read in their subsequent critical sections.

 The writer waits for every existing reader to switch, and then acquires the
 global rwlock for write and enters his critical section. Later, the writer
 signals all readers that he is done, and that they can go back to using their
 per-cpu refcounts again.

 Note that the lock-safety (despite the per-cpu scheme) comes from the fact
 that the readers can *choose* _when_ to switch to rwlocks upon the writer's
 signal. And the readers don't wait on anybody based on the per-cpu counters.
 The only true synchronization that involves waiting at the reader-side in this
 scheme, is the one arising from the global rwlock, which is safe from circular
 locking dependency issues.

 Reader-writer locks and per-cpu counters are recursive, so they can be
 used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
 recursive. Also, this design of switching the synchronization scheme ensures
 that you can safely nest and use these locks in a very flexible manner.

 I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
 suggestions and ideas, which inspired and influenced many of the decisions in
 this as well as previous designs. Thanks a lot Michael and Xiao!

 Cc: David Howells dhowe...@redhat.com
 Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
 ---

  lib/percpu-rwlock.c |  139 
 ++-
  1 file changed, 137 insertions(+), 2 deletions(-)

 diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
 index f938096..edefdea 100644
 --- a/lib/percpu-rwlock.c
 +++ b/lib/percpu-rwlock.c
 @@ -27,6 +27,24 @@
  #include linux/percpu-rwlock.h
  #include linux/errno.h

 +#include asm/processor.h
 +
 +
 +#define reader_yet_to_switch(pcpu_rwlock, cpu) \
 +   (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)-rw_state, 
 cpu)-reader_refcnt))
 +
 +#define reader_percpu_nesting_depth(pcpu_rwlock) \
 +   (__this_cpu_read((pcpu_rwlock)-rw_state-reader_refcnt))
 +
 +#define reader_uses_percpu_refcnt(pcpu_rwlock) \
 +   reader_percpu_nesting_depth(pcpu_rwlock)
 +
 +#define reader_nested_percpu(pcpu_rwlock)  \
 +   (reader_percpu_nesting_depth(pcpu_rwlock)  1)
 +
 +#define writer_active(pcpu_rwlock) \
 +   (__this_cpu_read((pcpu_rwlock)-rw_state-writer_signal))
 +

  int __percpu_init_rwlock(struct percpu_rwlock *pcpu_rwlock,
  const char *name, struct lock_class_key *rwlock_key)
 @@ -55,21 +73,138 @@ void percpu_free_rwlock(struct percpu_rwlock 
 *pcpu_rwlock)

  void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
  {
 -   read_lock(pcpu_rwlock-global_rwlock);
 +   preempt_disable();
 +
 +   /*
 +* Let the writer know that a reader is active, even before we choose
 +* our reader-side 

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-26 Thread Lai Jiangshan
On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:

 Hi Lai,

 I'm really not convinced that piggy-backing on lglocks would help
 us in any way. But still, let me try to address some of the points
 you raised...

 On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 [...]

 Unfortunately, I see quite a few issues with the code above. IIUC, the
 writer and the reader both increment the same counters. So how will the
 unlock() code in the reader path know when to unlock which of the locks?

 The same as your code, the reader(which nested in write C.S.) just dec
 the counters.

 And that works fine in my case because the writer and the reader update
 _two_ _different_ counters.

I can't find any magic in your code, they are the same counter.

/*
 * It is desirable to allow the writer to acquire the percpu-rwlock
 * for read (if necessary), without deadlocking or getting complaints
 * from lockdep. To achieve that, just increment the reader_refcnt of
 * this CPU - that way, any attempt by the writer to acquire the
 * percpu-rwlock for read, will get treated as a case of nested percpu
 * reader, which is safe, from a locking perspective.
 */
this_cpu_inc(pcpu_rwlock-rw_state-reader_refcnt);


 If both of them update the same counter, there
 will be a semantic clash - an increment of the counter can either mean that
 a new writer became active, or it can also indicate a nested reader. A 
 decrement
 can also similarly have 2 meanings. And thus it will be difficult to decide
 the right action to take, based on the value of the counter.


 (The counter-dropping-to-zero logic is not safe, since it can be updated
 due to different reasons). And now that I look at it again, in the absence
 of the writer, the reader is allowed to be recursive at the heavy cost of
 taking the global rwlock for read, every 2nd time you nest (because the
 spinlock is non-recursive).

 (I did not understand your comments of this part)
 nested reader is considered seldom.

 No, nested readers can be _quite_ frequent. Because, potentially all users
 of preempt_disable() are readers - and its well-known how frequently we
 nest preempt_disable(). As a simple example, any atomic reader who calls
 smp_call_function() will become a nested reader, because smp_call_function()
 itself is a reader. So reader nesting is expected to be quite frequent.

 But if N(=2) nested readers happen,
 the overhead is:
 1 spin_try_lock() + 1 read_lock() + (N-1) __this_cpu_inc()


 In my patch, its just this_cpu_inc(). Note that these are _very_ hot paths.
 So every bit of optimization that you can add is worthwhile.

 And your read_lock() is a _global_ lock - thus, it can lead to a lot of
 cache-line bouncing. That's *exactly* why I have used per-cpu refcounts in
 my synchronization scheme, to avoid taking the global rwlock as much as 
 possible.

 Another important point to note is that, the overhead we are talking about
 here, exists even when _not_ performing hotplug. And its the replacement to
 the super-fast preempt_disable(). So its extremely important to consciously
 minimize this overhead - else we'll end up slowing down the system 
 significantly.


All I was considered is nested reader is seldom, so I always
fallback to rwlock when nested.
If you like, I can add 6 lines of code, the overhead is
1 spin_try_lock()(fast path)  + N  __this_cpu_inc()

The overhead of your code is
2 smp_mb() + N __this_cpu_inc()

I don't see how much different.

 Also, this lg_rwlock implementation uses 3
 different data-structures - a per-cpu spinlock, a global rwlock and
 a per-cpu refcnt, and its not immediately apparent why you need those many
 or even those many varieties.

 data-structures is the same as yours.
 fallback_reader_refcnt -- reader_refcnt

 This has semantic problems, as noted above.

 per-cpu spinlock -- write_signal

 Acquire/release of (spin) lock is costlier than inc/dec of a counter, IIUC.

 fallback_rwlock  --- global_rwlock

 Also I see that this doesn't handle the
 case of interrupt-handlers also being readers.

 handled. nested reader will see the ref or take the fallback_rwlock


Sorry, _reentrance_ read_lock() will see the ref or take the fallback_rwlock


 I'm not referring to simple nested readers here, but interrupt handlers who
 can act as readers. For starters, the arch_spin_trylock() is not safe when
 interrupt handlers can also run the same code, right? You'll need to save
 and restore interrupts at critical points in the code. Also, the __foo()
 variants used to read

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-26 Thread Lai Jiangshan
On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:

 Hi Lai,

 I'm really not convinced that piggy-backing on lglocks would help
 us in any way. But still, let me try to address some of the points
 you raised...

 On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 [...]

 Unfortunately, I see quite a few issues with the code above. IIUC, the
 writer and the reader both increment the same counters. So how will the
 unlock() code in the reader path know when to unlock which of the locks?

 The same as your code, the reader(which nested in write C.S.) just dec
 the counters.

 And that works fine in my case because the writer and the reader update
 _two_ _different_ counters.

 I can't find any magic in your code, they are the same counter.

 /*
  * It is desirable to allow the writer to acquire the percpu-rwlock
  * for read (if necessary), without deadlocking or getting complaints
  * from lockdep. To achieve that, just increment the reader_refcnt of
  * this CPU - that way, any attempt by the writer to acquire the
  * percpu-rwlock for read, will get treated as a case of nested 
 percpu
  * reader, which is safe, from a locking perspective.
  */
 this_cpu_inc(pcpu_rwlock-rw_state-reader_refcnt);


 Whoa! Hold on, were you really referring to _this_ increment when you said
 that, in your patch you would increment the refcnt at the writer? Then I guess
 there is a major disconnect in our conversations. (I had assumed that you were
 referring to the update of writer_signal, and were just trying to have a 
 single
 refcnt instead of reader_refcnt and writer_signal).

https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d

Sorry the name fallback_reader_refcnt misled you.


 So, please let me clarify things a bit here. Forget about the above increment
 of reader_refcnt at the writer side. Its almost utterly insignificant for our
 current discussion. We can simply replace it with a check as shown below, at
 the reader side:

 void percpu_read_lock_irqsafe()
 {
 if (current == active_writer)
 return;

 /* Rest of the code */
 }

 Now, assuming that, in your patch, you were trying to use the per-cpu refcnt
 to allow the writer to safely take the reader path, you can simply get rid of
 that percpu-refcnt, as demonstrated above.

 So that would reduce your code to the following (after simplification):

 lg_rwlock_local_read_lock()
 {
 if (current == active_writer)
 return;
 if (arch_spin_trylock(per-cpu-spinlock))
 return;
 read_lock(global-rwlock);
 }

 Now, let us assume that hotplug is not happening, meaning, nobody is running
 the writer side code. Now let's see what happens at the reader side in your
 patch. As I mentioned earlier, the readers are _very_ frequent and can be in
 very hot paths. And they also happen to be nested quite often.

 So, a non-nested reader acquires the per-cpu spinlock. Every subsequent nested
 reader on that CPU has to acquire the global rwlock for read. Right there you
 have 2 significant performance issues -
 1. Acquiring the (spin) lock is costly
 2. Acquiring the global rwlock causes cacheline bouncing, which hurts
performance.

 And why do we care so much about performance here? Because, the existing
 kernel does an efficient preempt_disable() here - which is an optimized
 per-cpu counter increment. Replacing that with such heavy primitives on the
 reader side can be very bad.

 Now, how does my patchset tackle this? My scheme just requires an increment
 of a per-cpu refcnt (reader_refcnt) and memory barrier. Which is acceptable
 from a performance-perspective, because IMHO its not horrendously worse than
 a preempt_disable().


 If both of them update the same counter, there
 will be a semantic clash - an increment of the counter can either mean that
 a new writer became active, or it can also indicate a nested reader. A 
 decrement
 can also similarly have 2 meanings. And thus it will be difficult to decide
 the right action to take, based on the value of the counter.


 (The counter-dropping-to-zero logic is not safe, since it can be updated
 due to different reasons). And now that I look at it again, in the absence
 of the writer, the reader is allowed to be recursive at the heavy cost of
 taking the global rwlock for read, every 2nd time you nest (because

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-25 Thread Lai Jiangshan
Hi, Srivatsa,

The target of the whole patchset is nice for me.
A question: How did you find out the such usages of
preempt_disable() and convert them? did all are converted?

And I think the lock is too complex and reinvent the wheel, why don't
you reuse the lglock?
I wrote an untested draft here.

Thanks,
Lai

PS: Some HA tools(I'm writing one) which takes checkpoints of
virtual-machines frequently, I guess this patchset can speedup the
tools.

From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan la...@cn.fujitsu.com
Date: Mon, 25 Feb 2013 23:14:27 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

locality via lglock(trylock)
read-preference read-write-lock via fallback rwlock_t

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
 include/linux/lglock.h |   31 +++
 kernel/lglock.c|   45 +
 2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 0d24e93..30fe887 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);

+struct lgrwlock {
+   unsigned long __percpu *fallback_reader_refcnt;
+   struct lglock lglock;
+   rwlock_t fallback_rwlock;
+};
+
+#define DEFINE_LGRWLOCK(name)  \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;\
+   static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
+   struct lgrwlock name = {\
+   .fallback_reader_refcnt = name ## _refcnt, \
+   .lglock = { .lock = name ## _lock } }
+
+#define DEFINE_STATIC_LGRWLOCK(name)   \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;\
+   static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
+   static struct lgrwlock name = { \
+   .fallback_reader_refcnt = name ## _refcnt, \
+   .lglock = { .lock = name ## _lock } }
+
+static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
+{
+   lg_lock_init(lgrw-lglock, name);
+}
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
 #endif
diff --git a/kernel/lglock.c b/kernel/lglock.c
index 6535a66..463543a 100644
--- a/kernel/lglock.c
+++ b/kernel/lglock.c
@@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
preempt_enable();
 }
 EXPORT_SYMBOL(lg_global_unlock);
+
+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+   struct lglock *lg = lgrw-lglock;
+
+   preempt_disable();
+   if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
+   if (likely(arch_spin_trylock(this_cpu_ptr(lg-lock {
+   rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
+   return;
+   }
+   read_lock(lgrw-fallback_rwlock);
+   }
+
+   __this_cpu_inc(*lgrw-fallback_reader_refcnt);
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+   if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
+   lg_local_unlock(lgrw-lglock);
+   return;
+   }
+
+   if (!__this_cpu_dec_return(*lgrw-fallback_reader_refcnt))
+   read_unlock(lgrw-fallback_rwlock);
+
+   preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+   lg_global_lock(lgrw-lglock);
+   write_lock(lgrw-fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+   write_unlock(lgrw-fallback_rwlock);
+   lg_global_unlock(lgrw-lglock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);
-- 
1.7.7.6
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-25 Thread Lai Jiangshan
On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 A question: How did you find out the such usages of
 preempt_disable() and convert them? did all are converted?


 Well, I scanned through the source tree for usages which implicitly
 disabled CPU offline and converted them over. Its not limited to uses
 of preempt_disable() alone - even spin_locks, rwlocks, local_irq_disable()
 etc also help disable CPU offline. So I tried to dig out all such uses
 and converted them. However, since the merge window is open, a lot of
 new code is flowing into the tree. So I'll have to rescan the tree to
 see if there are any more places to convert.

 And I think the lock is too complex and reinvent the wheel, why don't
 you reuse the lglock?

 lglocks? No way! ;-) See below...

 I wrote an untested draft here.

 Thanks,
 Lai

 PS: Some HA tools(I'm writing one) which takes checkpoints of
 virtual-machines frequently, I guess this patchset can speedup the
 tools.

 From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
 From: Lai Jiangshan la...@cn.fujitsu.com
 Date: Mon, 25 Feb 2013 23:14:27 +0800
 Subject: [PATCH] lglock: add read-preference local-global rwlock

 locality via lglock(trylock)
 read-preference read-write-lock via fallback rwlock_t

 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
  include/linux/lglock.h |   31 +++
  kernel/lglock.c|   45 +
  2 files changed, 76 insertions(+), 0 deletions(-)

 diff --git a/include/linux/lglock.h b/include/linux/lglock.h
 index 0d24e93..30fe887 100644
 --- a/include/linux/lglock.h
 +++ b/include/linux/lglock.h
 @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
  void lg_global_lock(struct lglock *lg);
  void lg_global_unlock(struct lglock *lg);

 +struct lgrwlock {
 + unsigned long __percpu *fallback_reader_refcnt;
 + struct lglock lglock;
 + rwlock_t fallback_rwlock;
 +};
 +
 +#define DEFINE_LGRWLOCK(name)   
  \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + struct lgrwlock name = {\
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +#define DEFINE_STATIC_LGRWLOCK(name) \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + static struct lgrwlock name = { \
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
 +{
 + lg_lock_init(lgrw-lglock, name);
 +}
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
  #endif
 diff --git a/kernel/lglock.c b/kernel/lglock.c
 index 6535a66..463543a 100644
 --- a/kernel/lglock.c
 +++ b/kernel/lglock.c
 @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
   preempt_enable();
  }
  EXPORT_SYMBOL(lg_global_unlock);
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 + struct lglock *lg = lgrw-lglock;
 +
 + preempt_disable();
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + if (likely(arch_spin_trylock(this_cpu_ptr(lg-lock {
 + rwlock_acquire_read(lg-lock_dep_map, 0, 0, _RET_IP_);
 + return;
 + }
 + read_lock(lgrw-fallback_rwlock);
 + }
 +
 + __this_cpu_inc(*lgrw-fallback_reader_refcnt);
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
 +
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + lg_local_unlock(lgrw-lglock);
 + return;
 + }
 +
 + if (!__this_cpu_dec_return(*lgrw-fallback_reader_refcnt))
 + read_unlock(lgrw-fallback_rwlock);
 +
 + preempt_enable();
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
 +

 If I read the code above correctly, all you are doing is implementing a
 recursive reader-side primitive (ie., allowing the reader to call these
 functions recursively, without resulting in a self-deadlock).

 But the thing is, making the reader-side

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-25 Thread Lai Jiangshan
On Tue, Feb 26, 2013 at 8:17 AM, Lai Jiangshan eag0...@gmail.com wrote:
 On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
 srivatsa.b...@linux.vnet.ibm.com wrote:
 Hi Lai,

 On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
 Hi, Srivatsa,

 The target of the whole patchset is nice for me.

 Cool! Thanks :-)

 A question: How did you find out the such usages of
 preempt_disable() and convert them? did all are converted?


 Well, I scanned through the source tree for usages which implicitly
 disabled CPU offline and converted them over. Its not limited to uses
 of preempt_disable() alone - even spin_locks, rwlocks, local_irq_disable()
 etc also help disable CPU offline. So I tried to dig out all such uses
 and converted them. However, since the merge window is open, a lot of
 new code is flowing into the tree. So I'll have to rescan the tree to
 see if there are any more places to convert.

 And I think the lock is too complex and reinvent the wheel, why don't
 you reuse the lglock?

 lglocks? No way! ;-) See below...

 I wrote an untested draft here.

 Thanks,
 Lai

 PS: Some HA tools(I'm writing one) which takes checkpoints of
 virtual-machines frequently, I guess this patchset can speedup the
 tools.

 From 01db542693a1b7fc6f9ece45d57cb529d9be5b66 Mon Sep 17 00:00:00 2001
 From: Lai Jiangshan la...@cn.fujitsu.com
 Date: Mon, 25 Feb 2013 23:14:27 +0800
 Subject: [PATCH] lglock: add read-preference local-global rwlock

 locality via lglock(trylock)
 read-preference read-write-lock via fallback rwlock_t

 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
  include/linux/lglock.h |   31 +++
  kernel/lglock.c|   45 +
  2 files changed, 76 insertions(+), 0 deletions(-)

 diff --git a/include/linux/lglock.h b/include/linux/lglock.h
 index 0d24e93..30fe887 100644
 --- a/include/linux/lglock.h
 +++ b/include/linux/lglock.h
 @@ -67,4 +67,35 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
  void lg_global_lock(struct lglock *lg);
  void lg_global_unlock(struct lglock *lg);

 +struct lgrwlock {
 + unsigned long __percpu *fallback_reader_refcnt;
 + struct lglock lglock;
 + rwlock_t fallback_rwlock;
 +};
 +
 +#define DEFINE_LGRWLOCK(name)  
   \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + struct lgrwlock name = {\
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +#define DEFINE_STATIC_LGRWLOCK(name) \
 + static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
 + = __ARCH_SPIN_LOCK_UNLOCKED;\
 + static DEFINE_PER_CPU(unsigned long, name ## _refcnt);  \
 + static struct lgrwlock name = { \
 + .fallback_reader_refcnt = name ## _refcnt, \
 + .lglock = { .lock = name ## _lock } }
 +
 +static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
 +{
 + lg_lock_init(lgrw-lglock, name);
 +}
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_lock(struct lgrwlock *lgrw);
 +void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw);
  #endif
 diff --git a/kernel/lglock.c b/kernel/lglock.c
 index 6535a66..463543a 100644
 --- a/kernel/lglock.c
 +++ b/kernel/lglock.c
 @@ -87,3 +87,48 @@ void lg_global_unlock(struct lglock *lg)
   preempt_enable();
  }
  EXPORT_SYMBOL(lg_global_unlock);
 +
 +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
 +{
 + struct lglock *lg = lgrw-lglock;
 +
 + preempt_disable();
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + if (likely(arch_spin_trylock(this_cpu_ptr(lg-lock {
 + rwlock_acquire_read(lg-lock_dep_map, 0, 0, 
 _RET_IP_);
 + return;
 + }
 + read_lock(lgrw-fallback_rwlock);
 + }
 +
 + __this_cpu_inc(*lgrw-fallback_reader_refcnt);
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
 +
 +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
 +{
 + if (likely(!__this_cpu_read(*lgrw-fallback_reader_refcnt))) {
 + lg_local_unlock(lgrw-lglock);
 + return;
 + }
 +
 + if (!__this_cpu_dec_return(*lgrw-fallback_reader_refcnt))
 + read_unlock(lgrw-fallback_rwlock);
 +
 + preempt_enable();
 +}
 +EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
 +

 If I read the code above correctly, all you are doing is implementing a
 recursive reader-side primitive (ie., allowing the reader to call these
 functions recursively, without