Re: [RFC 01/11] sched: introduce sys_cpumask in tsk to adapt asymmetric system

2014-11-17 Thread Liu ping fan
On Wed, Nov 12, 2014 at 5:22 PM, Srikar Dronamraju
sri...@linux.vnet.ibm.com wrote:
 * kernelf...@gmail.com kernelf...@gmail.com [2014-10-16 15:29:50]:

 Some system such as powerpc, some tsk (vcpu thread) can only run on
 the dedicated cpu. Since we adapt some asymmetric method to monitor the
 whole physical cpu. (powerKVM only allows the primary hwthread to
 set up runtime env for the secondary when entering guest).

 Nowadays, powerKVM run with all the secondary hwthread offline to ensure
 the vcpu threads only run on the primary thread. But we plan to keep all
 cpus online when running powerKVM to give more power when switching back
 to host, so introduce sys_allowed cpumask to reflect the cpuset which
 the vcpu thread can run on.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/linux/init_task.h |  1 +
  include/linux/sched.h |  6 ++
  kernel/sched/core.c   | 10 --
  3 files changed, 15 insertions(+), 2 deletions(-)

 diff --git a/include/linux/init_task.h b/include/linux/init_task.h
 index 2bb4c4f3..c56f69e 100644
 --- a/include/linux/init_task.h
 +++ b/include/linux/init_task.h
 @@ -172,6 +172,7 @@ extern struct task_group root_task_group;
   .normal_prio= MAX_PRIO-20,  \
   .policy = SCHED_NORMAL, \
   .cpus_allowed   = CPU_MASK_ALL, \
 + .sys_allowed = CPU_MASK_ALL,\

 Do we really need another mask, cant we just use cpus_allowed itself.

I think it is not easy to cast two request: chip inherit and user's
configuration onto one mask.

   .nr_cpus_allowed= NR_CPUS,  \
   .mm = NULL, \
   .active_mm  = init_mm, \
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 5c2c885..ce429f3 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1260,7 +1260,10 @@ struct task_struct {

   unsigned int policy;
   int nr_cpus_allowed;
 + /* Anded user and sys_allowed */
   cpumask_t cpus_allowed;
 + /* due to the feature of asymmetric, some tsk can only run on such cpu 
 */
 + cpumask_t sys_allowed;

  #ifdef CONFIG_PREEMPT_RCU
   int rcu_read_lock_nesting;
 @@ -2030,6 +2033,9 @@ static inline void tsk_restore_flags(struct 
 task_struct *task,
  }

  #ifdef CONFIG_SMP
 +extern void set_cpus_sys_allowed(struct task_struct *p,
 + const struct cpumask *new_mask);
 +
  extern void do_set_cpus_allowed(struct task_struct *p,
  const struct cpumask *new_mask);

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index ec1a286..2cd1ae3 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4596,13 +4596,19 @@ void init_idle(struct task_struct *idle, int cpu)
  }

  #ifdef CONFIG_SMP
 +void set_cpus_sys_allowed(struct task_struct *p,
 + const struct cpumask *new_mask)
 +{
 + cpumask_copy(p-sys_allowed, new_mask);
 +}
 +

 This function doesnt seem to be used anywhere... Not sure why it is
 introduced

Not layered the patches well :(  It is used later in the series.

Thx,
Fan

  void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
 *new_mask)
  {
   if (p-sched_class  p-sched_class-set_cpus_allowed)
   p-sched_class-set_cpus_allowed(p, new_mask);

 - cpumask_copy(p-cpus_allowed, new_mask);
 - p-nr_cpus_allowed = cpumask_weight(new_mask);
 + cpumask_and(p-cpus_allowed, p-sys_allowed, new_mask);
 + p-nr_cpus_allowed = cpumask_weight(p-cpus_allowed);
  }

  /*
 --
 1.8.3.1



 --
 Thanks and Regards
 Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 03/11] powerpc: kvm: add interface to control kvm function on a core

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 12:04 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 Hi Liu,

 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 When kvm is enabled on a core, we migrate all external irq to primary
 thread. Since currently, the kvmirq logic is handled by the primary
 hwthread.

 Todo: this patch lacks re-enable of irqbalance when kvm is disable on
 the core

 Why is a sysfs file introduced to trigger irq migration? Why is it not
 done during kvm module insert ? And similarly spread interrupts when the
 module is removed? Isn't this a saner way ?

Consider the scene: coreA and coreB, we want to enable KVM on coreA,
while keeping coreB unchanged.
In fact, I try to acheive something un-symmetric on the platform. Do
you think it is an justification?

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/sysfs.c| 39 
 ++
  arch/powerpc/sysdev/xics/xics-common.c | 12 +++
  2 files changed, 51 insertions(+)

 diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
 index 67fd2fd..a2595dd 100644
 --- a/arch/powerpc/kernel/sysfs.c
 +++ b/arch/powerpc/kernel/sysfs.c
 @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void)
   if (cpu_has_feature(CPU_FTR_DSCR))
   err = device_create_file(cpu_subsys.dev_root, 
 dev_attr_dscr_default);
  }
 +
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core)
 +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly
 +
 +static ssize_t show_kvm_enable(struct device *dev,
 + struct device_attribute *attr, char *buf)
 +{
 +}
 +
 +static ssize_t __used store_kvm_enable(struct device *dev,
 + struct device_attribute *attr, const char *buf,
 + size_t count)
 +{
 + struct cpumask stop_cpus;
 + unsigned long core, thr;
 +
 + sscanf(buf, %lx, core);
 + if (core  NR_CORES)
 + return -1;
 + if (!test_bit(core, kvm_on_core))
 + for (thr = 1; thr threads_per_core; thr++)
 + if (cpu_online(thr * threads_per_core + thr))
 + cpumask_set_cpu(thr * threads_per_core + thr, 
 stop_cpus);

 What is the above logic trying to do? Did you mean
 cpu_online(threads_per_core * core + thr) ?

Yeah. My mistake, should be cpumask_set_cpu(core * threads_per_core +
thr, stop_cpus)

 +
 + stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
 + set_bit(core, kvm_on_core);
 + return count;
 +}
 +
 +static DEVICE_ATTR(kvm_enable, 0600,
 + show_kvm_enable, store_kvm_enable);
 +
 +static void sysfs_create_kvm_enable(void)
 +{
 + device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable);
 +}
 +#endif
 +
  #endif /* CONFIG_PPC64 */

  #ifdef HAS_PPC_PMC_PA6T
 diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
 b/arch/powerpc/sysdev/xics/xics-common.c
 index fe0cca4..68b33d8 100644
 --- a/arch/powerpc/sysdev/xics/xics-common.c
 +++ b/arch/powerpc/sysdev/xics/xics-common.c
 @@ -258,6 +258,18 @@ unlock:
   raw_spin_unlock_irqrestore(desc-lock, flags);
   }
  }
 +
 +int xics_migrate_irqs_away_secondary(void *data)
 +{
 + int cpu = smp_processor_id();
 + if(cpu%thread_per_core != 0) {
 + WARN(condition, format...);
 + return 0;
 + }
 + /* In fact, if we can migrate the primary, it will be more fine */
 + ();

 Isn't the aim of the patch to migrate irqs away from the secondary onto
 the primary? But from above it looks like we are returning when we find
 out that we are secondary threads, isn't it?

Yes, will fix in next version.

 + return 0;
 +}
  #endif /* CONFIG_HOTPLUG_CPU */

 Note that xics_migrate_irqs_away() is defined under CONFIG_CPU_HOTPLUG.
 But we will need this option on PowerKVM even when hotplug is not
 configured in.

Yes, will fix the dependency in next version

Thx,
Fan

 Regards
 Preeti U Murthy
  #ifdef CONFIG_SMP


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 12:45 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 (This patch is a place holder.)

 If there is only one vcpu thread is ready(the other vcpu thread can
 wait for it to execute), the primary thread can enter tickless mode,

 We do not configure NOHZ_FULL to y by default. Hence no thread would
 enter tickless mode.

But NOHZ_FULL can be chosen by user, and we should survive from it :)

 which causes the primary keeps running, so the secondary has no
 opportunity to exit to host, even they have other tsk on them.

 The secondary threads can still get scheduling ticks. The decrementer of
 the secondary threads is still active. So as long as secondary threads
 are busy, scheduling ticks will fire and try to schedule a new task on them.

No. As my original thought, after enable KVM on core, the HDEC on
secondary is disabled, otherwise the host exit will be too frequent.
Any suggestion?

Thx,
Fan

 Regards
 Preeti U Murthy

 Introduce a kthread (anti_tickless) on primary, so when there is only
 one vcpu thread on primary, the secondary can resort to anti_tickless
 to keep the primary out of tickless mode.
 (I thought that anti_tickless thread can goto NAP, so we can let the
 secondary run).

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/sysfs.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
 index a2595dd..f0b110e 100644
 --- a/arch/powerpc/kernel/sysfs.c
 +++ b/arch/powerpc/kernel/sysfs.c
 @@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device 
 *dev,
   if (!test_bit(core, kvm_on_core))
   for (thr = 1; thr threads_per_core; thr++)
   if (cpu_online(thr * threads_per_core + thr))
 - cpumask_set_cpu(thr * threads_per_core + thr, 
 stop_cpus);
 + cpumask_set_cpu(core * threads_per_core + thr, 
 stop_cpus);

   stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
 + /* fixme, create a kthread on primary hwthread to handle tickless mode 
 */
 + //kthread_create_on_cpu(prevent_tickless, NULL, core * 
 threads_per_core, ppckvm_prevent_tickless);
   set_bit(core, kvm_on_core);
   return count;
  }


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 1:32 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 Hi Liu,

 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 Nowadays, powerKVM runs with secondary hwthread offline. Although
 we can make all secondary hwthread online later, we still preserve
 this behavior for dedicated KVM env. Achieve this by setting
 paca-online as false.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/paca.h |  3 +++
  arch/powerpc/kernel/asm-offsets.c   |  3 +++
  arch/powerpc/kernel/smp.c   |  3 +++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 
  4 files changed, 21 insertions(+)

 diff --git a/arch/powerpc/include/asm/paca.h 
 b/arch/powerpc/include/asm/paca.h
 index a5139ea..67c2500 100644
 --- a/arch/powerpc/include/asm/paca.h
 +++ b/arch/powerpc/include/asm/paca.h
 @@ -84,6 +84,9 @@ struct paca_struct {
   u8 cpu_start;   /* At startup, processor spins until */
   /* this becomes non-zero. */
   u8 kexec_state; /* set when kexec down has irqs off */
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + u8 online;
 +#endif
  #ifdef CONFIG_PPC_STD_MMU_64
   struct slb_shadow *slb_shadow_ptr;
   struct dtl_entry *dispatch_log;
 diff --git a/arch/powerpc/kernel/asm-offsets.c 
 b/arch/powerpc/kernel/asm-offsets.c
 index 9d7dede..0faa8fe 100644
 --- a/arch/powerpc/kernel/asm-offsets.c
 +++ b/arch/powerpc/kernel/asm-offsets.c
 @@ -182,6 +182,9 @@ int main(void)
   DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc));
   DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase));
   DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + DEFINE(PACAONLINE, offsetof(struct paca_struct, online));
 +#endif
   DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
   DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened));
   DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
 index a0738af..4c3843e 100644
 --- a/arch/powerpc/kernel/smp.c
 +++ b/arch/powerpc/kernel/smp.c
 @@ -736,6 +736,9 @@ void start_secondary(void *unused)

   cpu_startup_entry(CPUHP_ONLINE);

 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + get_paca()-online = true;
 +#endif
   BUG();
  }

 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index f0c4db7..d5594b0 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -322,6 +322,13 @@ kvm_no_guest:
   li  r0, KVM_HWTHREAD_IN_NAP
   stb r0, HSTATE_HWTHREAD_STATE(r13)
  kvm_do_nap:
 +#ifdef PPCKVM_ENABLE_SECONDARY
 + /* check the cpu is needed by host or not */
 + ld  r2, PACAONLINE(r13)
 + ld  r3, 0
 + cmp r2, r3
 + bne kvm_secondary_exit_trampoline
 +#endif
   /* Clear the runlatch bit before napping */
   mfspr   r2, SPRN_CTRLF
   clrrdi  r2, r2, 1
 @@ -340,6 +347,11 @@ kvm_do_nap:
   nap
   b   .

 +#ifdef PPCKVM_ENABLE_SECONDARY
 +kvm_secondary_exit_trampoline:
 + b   .

 Uh? When we have no vcpu to run, we loop here instead of doing a nap?
 What are we achieving?

 If I understand the intention of the patch well, we are looking to
 provide a knob whereby the host can indicate if it needs the secondaries
 at all.

Yes, you catch it :)

 Today the host does boot with all threads online. There are some init
 scripts which take the secondaries down. So today the host does not have
 a say in preventing this, compile time or runtime. So lets see how we
 can switch between the two behaviors if we don't have the init script,
 which looks like a saner thing to do.

 We should set the paca-online flag to false by default. If
 KVM_PPC_ENABLE_SECONDARY is configured, we need to set this flag to
 true. So at compile time, we resolve the flag.

 While booting, we look at the flag and decide whether to get the
 secondaries online. So we get the current behavior if we have not
 configured KVM_PPC_ENABLE_SECONDARY. Will this achieve the purpose of
 this patch?

At boot time, KVM can not run. So we can achieve the change of the
flag by soft cpu hotplug on/off.
I think this is a more flexible way.

Thx,
Fan

 Regards
 Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 09/11] powerpc: kvm: handle time base on secondary hwthread

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 2:40 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 (This is a place holder patch.)
 We need to store the time base for host on secondary hwthread.
 Later when switching back, we need to reprogram it with elapse
 time.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++
  1 file changed, 6 insertions(+)

 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index 89ea16c..a817ba6 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)

   /* fixme: store other register such as msr */

 + /* fixme: store the tb, and set it as MAX, so we cease the tick on 
 secondary */

 This can lead to serious consequences. First of all, even if we set the
 decrementer(not tb) to MAX, it is bound to fire after 4s. That is the
 maximum time till which you can keep off the decrementer from firing.

 In the hotplug path, the offline cpus nap and their decrementers are
 programmed to fire at MAX too. But the difference is that we clear the
 LPCR_PECE1 wakeup bit which prevents cpus from waking up on a
 decrementer interrupt.

 We cannot afford to do this here though because there are other tasks on
 the secondary threads' runqueue. They need to be scheduled in.
 There are also timers besides the tick_sched one, which can be queued on
 these secondary threads. These patches have not taken care to migrate
 timers or tasks before entering guest as far as I observed. Hence we
 cannot just turn off time base like this and expect to handle the above
 mentioned events the next time the primary thread decides to exit to the
 host.

Yes, that is the nut in this series. My plan is to compensate the
hrtimer when the secondary exit.
But as to the scheduler on secondary, if it is ceased, what is side-effect?

Thx,
Fan

 Regards
 Preeti U Murthy
 +
   /* prevent us to enter kernel */
   li  r0, 1
   stb r0, HSTATE_HWTHREAD_REQ(r13)
 @@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)

  /* enter with vmode */
  kvmppc_secondary_stopper_exit:
 + /* fixme: restore the tb, with the orig val plus time elapse
 + * so we can fire the hrtimer as soon as possible
 + */
 +
   /* fixme, restore the stack which we store on lpaca */

   ld  r0, 112+PPC_LR_STKOFF(r1)


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 2:44 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 01:00 AM, kernelf...@gmail.com wrote:
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/Kconfig | 4 
  1 file changed, 4 insertions(+)

 diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
 index 602eb51..de38566 100644
 --- a/arch/powerpc/kvm/Kconfig
 +++ b/arch/powerpc/kvm/Kconfig
 @@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV

 If unsure, say N.

 +config KVMPPC_ENABLE_SECONDARY
 + tristate KVM support for running on secondary hwthread in host
 + depends on KVM_BOOK3S_64_HV

 This patch is required ontop of all the rest :) The top patches won't
 compile without this one. Every patch in the patchset should be able to
 compile successfully without the aid of the patches that come after it.

I think here is a conflict. If we do so, then we should make effort to
prevent the independent patch to take effect before the whole patchset
is applied.

Thx,
Fan

 Regards
 Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-29 Thread Liu ping fan
On Tue, Jul 29, 2014 at 2:57 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote:
 Hope I am right.  Take the following seq as an example

 if (hptep[0]  HPTE_V_VALID) {
 /* HPTE was previously valid, so we need to invalidate it */
 unlock_rmap(rmap);
 hptep[0] |= HPTE_V_ABSENT;
 kvmppc_invalidate_hpte(kvm, hptep, index);
 /* don't lose previous R and C bits */
 r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 } else {
 kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
 }
   - if we try_to_unmap on
 pfn at here, then @r contains a invalid pfn
 hptep[1] = r;
 eieio();
 hptep[0] = hpte[0];
 asm volatile(ptesync : : : memory);

 If that was the case we would have the same race in kvmppc_do_h_enter().

 I think the fact that the HPTE is locked will prevent the race, ie,
 HPTE_V_HVLOCK is set until hptep[0] is written to.

 If I look at at the unmap case, my understanding is that it uses
 kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte)
 and so shouldn't have a race vs the above code.

Yes, you are right :)

Thx,
Fan


 Or do you see a race I don't ?

 Cheers,
 Ben.

 Thx.
 Fan

 On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
  On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
  In current code, the setup of hpte is under the risk of race with
  mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
  Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.
 
  Please describe the race you think you see. I'm quite sure both Paul and
  I went over that code and somewhat convinced ourselves that it was ok
  but it's possible that we were both wrong :-)
 
  Cheers,
  Ben.
 
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
  ---
   arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
   1 file changed, 10 insertions(+), 5 deletions(-)
 
  diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
  b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  index 8056107..e6dcff4 100644
  --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
  +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run 
  *run, struct kvm_vcpu *vcpu,
 
if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
  - unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
  +
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
  + unlock_rmap(rmap);
} else {
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
  - hptep[1] = r;
  - eieio();
  - hptep[0] = hpte[0];
  - asm volatile(ptesync : : : memory);
preempt_enable();
if (page  hpte_is_writable(r))
SetPageDirty(page);
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-28 Thread Liu Ping Fan
In current code, the setup of hpte is under the risk of race with
mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8056107..e6dcff4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
-   unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
+
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile(ptesync : : : memory);
+   unlock_rmap(rmap);
} else {
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile(ptesync : : : memory);
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
-   hptep[1] = r;
-   eieio();
-   hptep[0] = hpte[0];
-   asm volatile(ptesync : : : memory);
preempt_enable();
if (page  hpte_is_writable(r))
SetPageDirty(page);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-28 Thread Liu ping fan
Hope I am right.  Take the following seq as an example

if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
} else {
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
  - if we try_to_unmap on
pfn at here, then @r contains a invalid pfn
hptep[1] = r;
eieio();
hptep[0] = hpte[0];
asm volatile(ptesync : : : memory);

Thx.
Fan

On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
 In current code, the setup of hpte is under the risk of race with
 mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
 Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

 Please describe the race you think you see. I'm quite sure both Paul and
 I went over that code and somewhat convinced ourselves that it was ok
 but it's possible that we were both wrong :-)

 Cheers,
 Ben.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 8056107..e6dcff4 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,

   if (hptep[0]  HPTE_V_VALID) {
   /* HPTE was previously valid, so we need to invalidate it */
 - unlock_rmap(rmap);
   hptep[0] |= HPTE_V_ABSENT;
   kvmppc_invalidate_hpte(kvm, hptep, index);
   /* don't lose previous R and C bits */
   r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 +
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
 + unlock_rmap(rmap);
   } else {
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
   }

 - hptep[1] = r;
 - eieio();
 - hptep[0] = hpte[0];
 - asm volatile(ptesync : : : memory);
   preempt_enable();
   if (page  hpte_is_writable(r))
   SetPageDirty(page);


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-15 Thread Liu Ping Fan
Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to map it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will
be mapped in virt mode and have the opportunity to be checked with placement.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
v4:
  more detail description 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..8fcc363 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-15 Thread Liu Ping Fan
Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to map it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will
be mapped in virt mode and have the opportunity to be checked with placement.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
v4:
  more detail description 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..8fcc363 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-14 Thread liu ping fan
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote:

 On 13.04.14 04:27, Liu ping fan wrote:

 On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote:

 On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com
 wrote:

 When we mark pte with _PAGE_NUMA we already call
 mmu_notifier_invalidate_range_start
 and mmu_notifier_invalidate_range_end, which will mark existing guest
 hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries.

 What happens when we don't? Why do we need the check? Why isn't it done
 implicitly? What happens when we treat a NUMA marked page as non-present?
 Why does it work out for us?

 Assume you have no idea what PAGE_NUMA is, but try to figure out what
 this patch does and whether you need to cherry-pick it into your downstream
 kernel. The description as is still is not very helpful for that. It doesn't
 even explain what really changes with this patch applied.

 Yeah.  what about appending the following description?  Can it make
 the context clear?
 Guest should not setup a hpte for the page whose pte is marked with
 _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect
 to check whether the page is placed correctly or not.


 Try to come up with a text that answers the following questions in order:

I divide them into 3 groups, and answer them by 3 sections. Seems that
it has the total story :)
Please take a look.

   - What does _PAGE_NUMA mean?
Group 1 - section 2

   - How does page migration with _PAGE_NUMA work?
   - Why should we not map pages when _PAGE_NUMA is set?
Group 2 - section 1
(Note: for the 1st question in this group, I am not sure about the
details, except that we can fix numa balancing by moving task or
moving page.  So I comment as  migration should be involved to cut
down the distance between the cpu and pages)

   - Which part of what needs to be done did the previous _PAGE_NUMA patch
 address?
   - What's the situation without this patch?
   - Which scenario does this patch fix?

Group 3 - section 3


Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to fix it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will
be fixed in virt mode and have the opportunity to be checked with placement.


Thx,
Fan


 Once you have a text that answers those, you should have a good patch
 description :).

 Alex


 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-14 Thread liu ping fan
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote:

 On 13.04.14 04:27, Liu ping fan wrote:

 On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote:

 On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com
 wrote:

 When we mark pte with _PAGE_NUMA we already call
 mmu_notifier_invalidate_range_start
 and mmu_notifier_invalidate_range_end, which will mark existing guest
 hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries.

 What happens when we don't? Why do we need the check? Why isn't it done
 implicitly? What happens when we treat a NUMA marked page as non-present?
 Why does it work out for us?

 Assume you have no idea what PAGE_NUMA is, but try to figure out what
 this patch does and whether you need to cherry-pick it into your downstream
 kernel. The description as is still is not very helpful for that. It doesn't
 even explain what really changes with this patch applied.

 Yeah.  what about appending the following description?  Can it make
 the context clear?
 Guest should not setup a hpte for the page whose pte is marked with
 _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect
 to check whether the page is placed correctly or not.


 Try to come up with a text that answers the following questions in order:

I divide them into 3 groups, and answer them by 3 sections. Seems that
it has the total story :)
Please take a look.

   - What does _PAGE_NUMA mean?
Group 1 - section 2

   - How does page migration with _PAGE_NUMA work?
   - Why should we not map pages when _PAGE_NUMA is set?
Group 2 - section 1
(Note: for the 1st question in this group, I am not sure about the
details, except that we can fix numa balancing by moving task or
moving page.  So I comment as  migration should be involved to cut
down the distance between the cpu and pages)

   - Which part of what needs to be done did the previous _PAGE_NUMA patch
 address?
   - What's the situation without this patch?
   - Which scenario does this patch fix?

Group 3 - section 3


Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to fix it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will
be fixed in virt mode and have the opportunity to be checked with placement.


Thx,
Fan


 Once you have a text that answers those, you should have a good patch
 description :).

 Alex


 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-12 Thread Liu ping fan
On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote:

 On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote:

 When we mark pte with _PAGE_NUMA we already call 
 mmu_notifier_invalidate_range_start
 and mmu_notifier_invalidate_range_end, which will mark existing guest hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries.

 What happens when we don't? Why do we need the check? Why isn't it done 
 implicitly? What happens when we treat a NUMA marked page as non-present? Why 
 does it work out for us?

 Assume you have no idea what PAGE_NUMA is, but try to figure out what this 
 patch does and whether you need to cherry-pick it into your downstream 
 kernel. The description as is still is not very helpful for that. It doesn't 
 even explain what really changes with this patch applied.

Yeah.  what about appending the following description?  Can it make
the context clear?
Guest should not setup a hpte for the page whose pte is marked with
_PAGE_NUMA, so on the host, the numa-fault mechanism can take effect
to check whether the page is placed correctly or not.


 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
 v3:
  rebased onto Alex's tree, branch kvm-ppc-next
  substitue commit log with the more clear description in Aneesh's reply 
 (thanks, Aneesh)
 ---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
 b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 index 1d6c56a..1117525 100644
 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
 flags,
   pte_size = psize;
   pte = lookup_linux_pte_and_update(pgdir, hva, writing,
 pte_size);
 - if (pte_present(pte)) {
 + if (pte_present(pte)!pte_numa(pte)) {

 Spaces missing

Will fix,

Thx,
Fan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-11 Thread Liu Ping Fan
When we mark pte with _PAGE_NUMA we already call 
mmu_notifier_invalidate_range_start
and mmu_notifier_invalidate_range_end, which will mark existing guest hpte
entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
guest hpte entries.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
v3:
  rebased onto Alex's tree, branch kvm-ppc-next
  substitue commit log with the more clear description in Aneesh's reply 
(thanks, Aneesh)
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..1117525 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)!pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-04-09 Thread Liu ping fan
On Mon, Apr 7, 2014 at 4:36 PM, Alexander Graf ag...@suse.de wrote:

 On 07.04.14 09:42, Aneesh Kumar K.V wrote:

 Alexander Graf ag...@suse.com writes:

 On 03.04.14 04:36, Liu ping fan wrote:

 Hi Alex, could you help to pick up this patch? since  v3.14 kernel can
 enable numa fault for powerpc.

 What bad happens without this patch? We map a page even though it was
 declared to get NUMA migrated? What happens next?

 Nothing much, we won't be properly accounting the numa access in the
 host.  What we want to achieve is to convert a guest access of the page to
 a host fault so that we can do proper numa access accounting in the
 host. This would enable us to migrate the page to the correct numa
 node.


 Ok, so no breakages, just less performance. I wouldn't consider it stable
 material then :).

Sorry to reply late, since I am half out of office during this period.
I am puzzling about you reply.Without this patch, the guest can
NOT sense the numa changes and ask host to put the pages in right
place.  So the pages which is used by guest will be always misplaced.
The numa-fault method is inspired by real requirement to improve
performance, so we should also consider the performance drop of guest.
Right?

Regards,
Fan


 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-04-02 Thread Liu ping fan
Hi Alex, could you help to pick up this patch? since  v3.14 kernel can
enable numa fault for powerpc.

Thx,
Fan

On Mon, Jan 27, 2014 at 5:11 PM, Alexander Graf ag...@suse.de wrote:

 On 21.01.2014, at 10:42, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com 
 wrote:

 Liu Ping Fan kernelf...@gmail.com writes:

 To make sure that on host, the pages marked with _PAGE_NUMA result in a 
 fault
 when guest access them, we should force the checking when guest uses 
 hypercall
 to setup hpte.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

 When we mark pte with _PAGE_NUMA we already call 
 mmu_notifier_invalidate_range_start and
 mmu_notifier_invalidate_range_end, which will mark existing guest hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries. This patch does that.

 So what happens next? We insert a page into the HTAB without HPTE_V_VALID 
 set, so the guest will fail to use it. If the guest does an H_READ on it it 
 will suddenly turn to V_VALID though?

 I might need a crash course in the use of HPTE_V_ABSENT.


 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-02-25 Thread Liu ping fan
);
}
}

if (clock_gettime(CLOCK_REALTIME, now) == -1)
handle_error(clock_gettime);

/* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */

new_value.it_value.tv_sec = now.tv_sec + elapsed;
new_value.it_value.tv_nsec = now.tv_nsec;
new_value.it_interval.tv_sec = 0;
new_value.it_interval.tv_nsec = 0;

if (timerfd_settime(fd, TFD_TIMER_ABSTIME, new_value, NULL) == -1)
handle_error(timerfd_settime);

pfd.fd = fd;
pfd.events = POLLIN;
pfd.revents = 0;
/* -1: infinite wait */
poll(pfd, 1, -1);



/* ask children to stop and get back cnt */

*(pmap + SHM_CMD_OFF) = CMD_STOP;

wait(NULL);
dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
printf(dst_info);
printf(total cnt:%lu\n, *(pmap + SHM_CNT_OFF));

munmap(pmap, PAGE_SIZE);
shm_unlink(SHM_FNAME);
}




On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf ag...@suse.de wrote:

 On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact 
 of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 If my suppose is correct, will CCing k...@vger.kernel.org from next 
 version.

 This translates to me as This is an RFC?

 Yes, I am not quite sure about it. I have no bare-metal to verify it.
 So I hope at least, from the theory, it is correct.

 Paul, could you please give this some thought and maybe benchmark it?


 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-01-21 Thread Liu Ping Fan
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault
when guest access them, we should force the checking when guest uses hypercall
to setup hpte.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v2:
  It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take 
effect
  And I imporve the changelog according to Aneesh's suggestion.
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-01-21 Thread Liu Ping Fan
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault
when guest access them, we should force the checking when guest uses hypercall
to setup hpte.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v2:
  It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take 
effect
  And I imporve the changelog according to Aneesh's suggestion.
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-21 Thread Liu ping fan
On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the 
 fact of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to 
 achieve what you're trying and convince your readers that it's a good 
 idea to do it the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

 After setting _PAGE_NUMA, we should flush out all hptes both in host's
 htab and guest's. So when guest tries to access memory, host finds
 that there is not hpte ready for guest in guest's htab. And host
 should raise dsi to guest.

 Now guest receive that fault, removes the PAGE_NUMA bit and do an
 hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
 cleared PAGE_NUMA bit.

This incurs that guest ends up in h_enter.
 And you can see in current code, we also try this quick path firstly.
 Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

 hmm ? hpte_hv_fault is the hypervisor handling the fault.

After we discuss in irc. I think we should also do the fast check in
kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?

Thanks and regards,
Fan
 -aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-21 Thread Liu ping fan
On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan kernelf...@gmail.com wrote:
 On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the 
 fact of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to 
 achieve what you're trying and convince your readers that it's a good 
 idea to do it the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

 After setting _PAGE_NUMA, we should flush out all hptes both in host's
 htab and guest's. So when guest tries to access memory, host finds
 that there is not hpte ready for guest in guest's htab. And host
 should raise dsi to guest.

 Now guest receive that fault, removes the PAGE_NUMA bit and do an
 hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
 cleared PAGE_NUMA bit.

This incurs that guest ends up in h_enter.
 And you can see in current code, we also try this quick path firstly.
 Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

 hmm ? hpte_hv_fault is the hypervisor handling the fault.

 After we discuss in irc. I think we should also do the fast check in
 kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
 and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. 
 Right?

Or we can delay the quick fix in H_ENTER, and let the host fault
again, so do the fix in kvmppc_hpte_hv_fault()

 Thanks and regards,
 Fan
 -aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-20 Thread Liu ping fan
On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact 
 of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

After setting _PAGE_NUMA, we should flush out all hptes both in host's
htab and guest's. So when guest tries to access memory, host finds
that there is not hpte ready for guest in guest's htab. And host
should raise dsi to guest. This incurs that guest ends up in h_enter.
And you can see in current code, we also try this quick path firstly.
Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

Thanks and regards,
Fan


 If my suppose is correct, will CCing k...@vger.kernel.org from next 
 version.

 This translates to me as This is an RFC?

 Yes, I am not quite sure about it. I have no bare-metal to verify it.
 So I hope at least, from the theory, it is correct.


 -aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-14 Thread Liu ping fan
On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: Numa 
 faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 0/3] 
 optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact of 
 heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

Sorry for the unclear message. After introducing the _PAGE_NUMA,
kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
should rely on host's kvmppc_book3s_hv_page_fault() to call
do_numa_page() to do the numa fault check. This incurs the overhead
when exiting from rmode to vmode.  My idea is that in
kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
there is no need to exit to vmode (i.e saving htab, slab switching)

 If my suppose is correct, will CCing k...@vger.kernel.org from next version.

 This translates to me as This is an RFC?

Yes, I am not quite sure about it. I have no bare-metal to verify it.
So I hope at least, from the theory, it is correct.

Thanks and regards,
Ping Fan

 Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] powernv: kvm: numa fault improvement

2013-12-11 Thread Liu Ping Fan
This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: Numa 
faults support for ppc64

For this series, I apply the same idea from the previous thread [PATCH 0/3] 
optimize for powerpc _PAGE_NUMA
(for which, I still try to get a machine to show nums)

But for this series, I think that I have a good justification -- the fact of 
heavy cost when switching context between guest and host,
which is  well known.

If my suppose is correct, will CCing k...@vger.kernel.org from next version.


Liu Ping Fan (4):
  mm: export numa_migrate_prep()
  powernv: kvm: make _PAGE_NUMA take effect
  powernv: kvm: extend input param for lookup_linux_pte
  powernv: kvm: make the handling of _PAGE_NUMA faster for guest

 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++---
 include/linux/mm.h  |  2 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte

2013-12-11 Thread Liu Ping Fan
It will be helpful for next patch

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
Can it be merged with the next patch?
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index af8602d..ae46052 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
 }
 
 static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
- int writing, unsigned long *pte_sizep)
+ int writing, unsigned long *pte_sizep,
+ pte_t **ptepp)
 {
pte_t *ptep;
unsigned long ps = *pte_sizep;
@@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long 
hva,
ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift);
if (!ptep)
return __pte(0);
+   if (ptepp != NULL)
+   *ptepp = ptep;
if (hugepage_shift)
*pte_sizep = 1ul  hugepage_shift;
else
@@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
/* Look up the Linux PTE for the backing page */
pte_size = psize;
-   pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
+   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL);
if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
@@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long 
flags,
memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
if (memslot) {
hva = __gfn_to_hva_memslot(memslot, gfn);
-   pte = lookup_linux_pte(pgdir, hva, 1, psize);
+   pte = lookup_linux_pte(pgdir, hva, 1, psize,
+   NULL);
if (pte_present(pte)  !pte_write(pte))
r = hpte_make_readonly(r);
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect

2013-12-11 Thread Liu Ping Fan
To make _PAGE_NUMA take effect, we should force the checking when
guest uses hypercall to setup hpte.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest

2013-12-11 Thread Liu Ping Fan
The period check of _PAGE_NUMA can probably happen on the correctly
placed page. For this case, when guest try to setup hpte in real mode,
we try to resolve the numa fault in real mode, since the switch between
guest context and host context costs too much.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index ae46052..a06b199 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
unsigned int writing;
unsigned long mmu_seq;
unsigned long rcbits;
+   struct mm_struct *mm = kvm-mm;
+   struct vm_area_struct *vma;
+   int page_nid, target_nid;
+   struct page *test_page;
+   pte_t *ptep;
 
psize = hpte_page_size(pteh, ptel);
if (!psize)
@@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
 
/* Look up the Linux PTE for the backing page */
pte_size = psize;
-   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL);
-   if (pte_present(pte)  !pte_numa(pte)) {
+   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, ptep);
+   if (pte_present(pte)) {
+   if (pte_numa(pte)) {
+   /* If fail, let gup handle it */
+   if (unlikely(!down_read_trylock(mm-mmap_sem)))
+   goto pte_check;
+
+   vma = find_vma(mm, hva);
+   up_read(mm-mmap_sem);
+   test_page = pte_page(pte);
+   page_nid = page_to_nid(test_page);
+   target_nid = numa_migrate_prep(test_page, vma,
+hva, page_nid);
+   put_page(test_page);
+   if (unlikely(target_nid != -1)) {
+   /* If fail, let gup handle it */
+   goto pte_check;
+   }
+   }
+
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
@@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
}
}
 
+pte_check:
if (pte_size  psize)
return H_PARAMETER;
if (pa  pte_size  psize)
@@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
pteh = ~HPTE_V_VALID;
unlock_rmap(rmap);
} else {
+   if (pte_numa(pte)  pa) {
+   pte = pte_mknonnuma(pte);
+   *ptep = pte;
+   }
kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index,
realmode);
/* Only set R/C in real HPTE if already set in *rmap */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] mm: export numa_migrate_prep()

2013-12-11 Thread Liu Ping Fan
powerpc will use it in fast path.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab0e22..420fb77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct 
vm_area_struct *vma, unsigned long
 extern int mprotect_fixup(struct vm_area_struct *vma,
  struct vm_area_struct **pprev, unsigned long start,
  unsigned long end, unsigned long newflags);
+extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+   unsigned long addr, int page_nid);
 
 /*
  * doesn't attempt to fault and will return short.
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-19 Thread Liu ping fan
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote:

 On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote:

 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Suppose the following scene:

 Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
 vcpus.

 If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
 out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
 caught in realmode for a long time.

 What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

 Oops! deadlock happens

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Any particular reason for the resend? The patch is already applied, no?

Oh, seems that I misunderstood your meaning. You said Actually, I've
changed my mind and moved the patch to the for-3.13 branch instead.
Please make sure to CC kvm@vger on all patches you submit though. So
I think it is necessary to resend with cc kvm@vger

Regards,
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-19 Thread Liu ping fan
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote:

 On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote:

 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Suppose the following scene:

 Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
 vcpus.

 If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
 out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
 caught in realmode for a long time.

 What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

 Oops! deadlock happens

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Any particular reason for the resend? The patch is already applied, no?

Oh, seems that I misunderstood your meaning. You said Actually, I've
changed my mind and moved the patch to the for-3.13 branch instead.
Please make sure to CC kvm@vger on all patches you submit though. So
I think it is necessary to resend with cc kvm@vger

Regards,
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] powerpc: kvm: optimize sc 1 as fast return

2013-11-18 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v3: add some document
---
 arch/powerpc/kvm/book3s_hv.c| 10 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..1addb1a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
+   /* hypercall with MSR_PR has already been handled in rmode,
+* and never reaches here.
+*/
+
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..eea7ca7 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 5: mtspr   SPRN_SRR0, r6
mtspr   SPRN_SRR1, r7
 
+/*
+ * Required state:
+ * R4 = vcpu
+ * R10: value for HSRR0
+ * R11: value for HSRR1
+ * R13 = PACA
+ */
 fast_guest_return:
li  r0,0
stb r0,VCPU_CEDED(r4)   /* cancel cede */
@@ -1388,7 +1395,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1417,15 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   mr  r4,r9
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-18 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..abf81fe 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
slb_v = vcpu-kvm-arch.vrma_slb_v;
}
 
+   preempt_disable();
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..ea17b30 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+/* When called from virtmode, this func should be protected by
+ * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
+ * can trigger deadlock issue.
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-18 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..abf81fe 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
slb_v = vcpu-kvm-arch.vrma_slb_v;
}
 
+   preempt_disable();
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..ea17b30 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+/* When called from virtmode, this func should be protected by
+ * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
+ * can trigger deadlock issue.
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] powerpc: kvm: optimize sc 1 as fast return

2013-11-18 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v3: add some document
---
 arch/powerpc/kvm/book3s_hv.c| 10 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..1addb1a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
+   /* hypercall with MSR_PR has already been handled in rmode,
+* and never reaches here.
+*/
+
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..eea7ca7 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 5: mtspr   SPRN_SRR0, r6
mtspr   SPRN_SRR1, r7
 
+/*
+ * Required state:
+ * R4 = vcpu
+ * R10: value for HSRR0
+ * R11: value for HSRR1
+ * R13 = PACA
+ */
 fast_guest_return:
li  r0,0
stb r0,VCPU_CEDED(r4)   /* cancel cede */
@@ -1388,7 +1395,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1417,15 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   mr  r4,r9
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] powerpc: kvm: optimize sc 1 as fast return

2013-11-17 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +++-
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..0d1e2c2 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,15 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   mr  r4,r9
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] powerpc: kvm: optimize sc 1 as fast return

2013-11-15 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..a463f08 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,14 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-15 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v4: remove the over-engineered part and keep it simple, also add some notes.
  
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..abf81fe 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
slb_v = vcpu-kvm-arch.vrma_slb_v;
}
 
+   preempt_disable();
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..ea17b30 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+/* When called from virtmode, this func should be protected by
+ * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
+ * can trigger deadlock issue.
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-08 Thread Liu ping fan
On Fri, Nov 8, 2013 at 12:19 PM, Liu ping fan kernelf...@gmail.com wrote:
 On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote:

 On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:

 syscall is a very common behavior inside guest, and this patch
 optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
 so hypervisor can return to guest without heavy exit, i.e, no need
 to swap TLB, HTAB,.. etc

 The syscall exit you touch here only happens when you do an sc  0 with 
 MSR_PR set inside the guest. The only case you realistically see this is 
 when you run PR KVM inside of an HV KVM guest.

 Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call
 Interrupt, no description about the MSR_PR when sc trigger a syscall
 interrupt. So I think, guest application sc 0 will also fall to the
 kernel who owns hypervisor mode.  Am I right?

Some further comment: I think the essential of the problem is whether
we switch RMA from guest to HV when interrupts raise.
DSI/ISI will be redirected to HDSI and RMA switch.  But what about
SYSCALL, and DEC, external interrupt, ...etc?

 I don't think we should optimize for that case. Instead, we should rather 
 try to not bounce to the 1st hypervisor in the first place in that scenario 
 :).

 Sorry, but just want to make clear about the idiom:  0 - kernel run
 with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up
 layer? Right?

 When you say try to not bounce to the 1st hypervisor , what is the
 exact meaning and how can we achieve this?  I am a quite newer on
 powerpc, and hope that I can get more clear figure about it  :)


Thanks
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene

2013-11-07 Thread Liu ping fan
On Thu, Nov 7, 2013 at 5:54 PM, Alexander Graf ag...@suse.de wrote:

 On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote:

 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Suppose the following scene:

 Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

 If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
 and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
 realmode for a long time.

 What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

 Oops! deadlock happens

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Very nice catch :).

 I think it makes a lot of sense to document the fact that 
 kvmppc_hv_find_lock_hpte() should be called with preemption disabled in a 
 comment above the function, so that next time when someone potentially calls 
 it, he knows that he needs to put preempt_disable() around it.

Ok, I will document them in v3

 Thanks a lot for finding this pretty subtle issue. May I ask how you got 
 there? Did you actually see systems deadlock because of this?

Intuition :). then I begin try to model a scene which causes the
deadlock. And fortunately, I find a case to verify my suspension.

Regards,
Pingfan

 Alex

 ---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 043eec8..dbc1478 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct 
 kvm_vcpu *vcpu, gva_t eaddr,
   }

   /* Find the HPTE in the hash table */
 + preempt_disable();
   index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
HPTE_V_VALID | HPTE_V_ABSENT);
 - if (index  0)
 + if (index  0) {
 + preempt_enable();
   return -ENOENT;
 + }
   hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
   v = hptep[0]  ~HPTE_V_HVLOCK;
   gr = kvm-arch.revmap[index].guest_rpte;
 @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
 *vcpu, gva_t eaddr,
   /* Unlock the HPTE */
   asm volatile(lwsync : : : memory);
   hptep[0] = v;
 + preempt_enable();

   gpte-eaddr = eaddr;
   gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
 --
 1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] powerpc/kvm: remove redundant assignment

2013-11-07 Thread Liu ping fan
On Thu, Nov 7, 2013 at 6:06 PM, Alexander Graf ag...@suse.de wrote:

 On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote:

 ret is assigned twice with the same value, so remove the later one.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 Acked-by: Paul Mackerras pau...@samba.org

 I suppose my last request for a patch description was slightly too 
 abbreviated :). Sorry about that.

 Imagine you are a Linux-stable maintainer. You have about 5000 patches in 
 front of you and you want to figure out whether a patch should get backported 
 into a stable tree or not.

 It's very easy to read through the patch description.
 It's reasonably easy to do a git show on the patch.
 It's very hard to look at the actual surrounding code that was changed.

 If I open a text editor on the file, I immediately see what you're saying:

 ret = RESUME_GUEST;
 preempt_disable();
 while (!try_lock_hpte(hptep, HPTE_V_HVLOCK))
 cpu_relax();
 if ((hptep[0]  ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] ||
 rev-guest_rpte != hpte[2])
 /* HPTE has been changed under us; let the guest retry */
 goto out_unlock;
 hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;

 rmap = memslot-arch.rmap[gfn - memslot-base_gfn];
 lock_rmap(rmap);

 /* Check if we might have been invalidated; let the guest retry if 
 so */
 ret = RESUME_GUEST;

 However, that scope is not given in the actual patch itself. If you look at 
 the diff below, you have no idea whether the patch is fixing a bug or just 
 removes duplication and doesn't actually have any effect. In fact, the 
 compiled assembly should be the same with this patch and without. But you 
 can't tell from the diff below.

 So what I would like to see in the patch description is something that makes 
 it easy to understand what's going on without the need to check out the 
 source file. Something like

 We redundantly set ret to RESUME_GUEST twice without changing it in between. 
 Only do it once:

 ret = RESUME_GUEST;
 preempt_disable();
 while (!try_lock_hpte(hptep, HPTE_V_HVLOCK))
 cpu_relax();
 if ((hptep[0]  ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] ||
 rev-guest_rpte != hpte[2])
 /* HPTE has been changed under us; let the guest retry */
 goto out_unlock;
 hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;

 rmap = memslot-arch.rmap[gfn - memslot-base_gfn];
 lock_rmap(rmap);

 /* Check if we might have been invalidated; let the guest retry if 
 so */
 ret = RESUME_GUEST;


 If I look at that patch description it immediately tells me Ah, no need to 
 worry, it's not a critical bug I need to backport. If you have a better idea 
 how to express that I'm more than happy to take that too. Otherwise just let 
 me know whether you like the description above and I'll modify it to the one 
 that includes the code snippet when applying the patch.

Oh, yes. Thank you very much..
And I had a better understanding of the heavy work of maintainers :)
Will keep this in mind.

Best regards,
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu Ping Fan
syscall is a very common behavior inside guest, and this patch
optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
so hypervisor can return to guest without heavy exit, i.e, no need
to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
Compiled, but lack of bare metal, I have not tested it yet.
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 -
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..9f626c3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_0_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,16 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_0_fast_return:
+   ld  r10,VCPU_PC(r9)
+   ld  r11,VCPU_MSR(r9)
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
+   and r11,r11,r3
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu ping fan
On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote:

 On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:

 syscall is a very common behavior inside guest, and this patch
 optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
 so hypervisor can return to guest without heavy exit, i.e, no need
 to swap TLB, HTAB,.. etc

 The syscall exit you touch here only happens when you do an sc  0 with 
 MSR_PR set inside the guest. The only case you realistically see this is when 
 you run PR KVM inside of an HV KVM guest.

Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call
Interrupt, no description about the MSR_PR when sc trigger a syscall
interrupt. So I think, guest application sc 0 will also fall to the
kernel who owns hypervisor mode.  Am I right?

 I don't think we should optimize for that case. Instead, we should rather try 
 to not bounce to the 1st hypervisor in the first place in that scenario :).

Sorry, but just want to make clear about the idiom:  0 - kernel run
with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up
layer? Right?

When you say try to not bounce to the 1st hypervisor , what is the
exact meaning and how can we achieve this?  I am a quite newer on
powerpc, and hope that I can get more clear figure about it  :)

Thanks

Pingfan

 Alex


 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
 Compiled, but lack of bare metal, I have not tested it yet.
 ---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 -
 2 files changed, 12 insertions(+), 7 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
 index 62a2b5a..73dc852 100644
 --- a/arch/powerpc/kvm/book3s_hv.c
 +++ b/arch/powerpc/kvm/book3s_hv.c
 @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
   /* hcall - punt to userspace */
   int i;

 - if (vcpu-arch.shregs.msr  MSR_PR) {
 - /* sc 1 from userspace - reflect to guest syscall */
 - kvmppc_book3s_queue_irqprio(vcpu, 
 BOOK3S_INTERRUPT_SYSCALL);
 - r = RESUME_GUEST;
 - break;
 - }
   run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
   for (i = 0; i  9; ++i)
   run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index c71103b..9f626c3 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
   ld  r3,VCPU_GPR(R3)(r9)
   andi.   r0,r11,MSR_PR
 - bne guest_exit_cont
 + /* sc 1 from userspace - reflect to guest syscall */
 + bne sc_0_fast_return
   clrrdi  r3,r3,2
   cmpldi  r3,hcall_real_table_end - hcall_real_table
   bge guest_exit_cont
 @@ -1409,6 +1410,16 @@ hcall_try_real_mode:
   ld  r11,VCPU_MSR(r4)
   b   fast_guest_return

 +sc_0_fast_return:
 + ld  r10,VCPU_PC(r9)
 + ld  r11,VCPU_MSR(r9)
 + mtspr   SPRN_SRR0,r10
 + mtspr   SPRN_SRR1,r11
 + li  r10, BOOK3S_INTERRUPT_SYSCALL
 + LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
 + and r11,r11,r3
 + b   fast_guest_return
 +
   /* We've attempted a real mode hcall, but it's punted it back
* to userspace.  We need to restore some clobbered volatiles
* before resuming the pass-it-to-qemu path */
 --
 1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu ping fan
On Fri, Nov 8, 2013 at 12:11 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Fri, 2013-11-08 at 15:05 +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2013-11-08 at 04:10 +0100, Alexander Graf wrote:
  On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:
 
   syscall is a very common behavior inside guest, and this patch
   optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
   so hypervisor can return to guest without heavy exit, i.e, no need
   to swap TLB, HTAB,.. etc
 
  The syscall exit you touch here only happens when you do an sc  0
  with MSR_PR set inside the guest. The only case you realistically see
  this is when you run PR KVM inside of an HV KVM guest.
 
  I don't think we should optimize for that case. Instead, we should
  rather try to not bounce to the 1st hypervisor in the first place in
  that scenario :).

 Well, so unfortunately openstack CI uses PR inside HV pretty
 heavily  it *might* be worthwhile optimizing that path if the patch
 is simple enough... I'd make that Paul's call.

 Note that this is a statement of value for the idea ... not the
 implementation ;-) From a quick look with Paulus, the patch is quite
 broken. I'll let Paul comment in details.

Thank you very much,

Regards,
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] powerpc: kvm: fix rare but potential deadlock scene

2013-11-07 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_book3s.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  5 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 20 
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a818932..3d710ba 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -129,9 +129,9 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu 
*vcpu);
 extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu, unsigned long addr,
unsigned long status);
-extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val);
+extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val, bool vmode);
 extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr,
-   unsigned long slb_v, unsigned long valid);
+   unsigned long slb_v, unsigned long valid, bool vmode);
 
 extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache 
*pte);
 extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 97685e7..12d9635 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -475,13 +475,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
 
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
-HPTE_V_VALID | HPTE_V_ABSENT);
+HPTE_V_VALID | HPTE_V_ABSENT,
+true);
if (index  0)
return -ENOENT;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0];
gr = kvm-arch.revmap[index].guest_rpte;
-   kvmppc_hv_unlock_hpte(hptep, v);
+   kvmppc_hv_unlock_hpte(hptep, v, true);
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 0ff9e91..18a9425 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,16 +749,22 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
-void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val)
+void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val,
+   bool vmode)
 {
*hpte_val = *hpte_val  ~HPTE_V_HVLOCK;
asm volatile(lwsync : : : memory);
*hptep = *hpte_val;
+   if (unlikely(vmode))
+   preempt_enable();
 }
 EXPORT_SYMBOL(kvmppc_hv_unlock_hpte);
 
+/* If called from virtmode and success to lock, then the context will be set
+ * as preemption disabled
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
- unsigned long valid)
+ unsigned long valid, bool vmode)
 {
unsigned int i;
unsigned int pshift;
@@ -796,6 +802,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, 
unsigned long slb_v,
avpn = ~0x7fUL;
val |= avpn;
 
+   if (unlikely(vmode))
+   preempt_disable();
+
for (;;) {
hpte = (unsigned long *)(kvm-arch.hpt_virt + (hash  7));
 
@@ -833,6 +842,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, 
unsigned long slb_v,
val |= HPTE_V_SECONDARY;
hash = hash ^ kvm-arch.hpt_mask;
}
+
+   if (unlikely(vmode))
+   preempt_enable();
return -1;
 }
 EXPORT_SYMBOL(kvmppc_hv_find_lock_hpte);
@@ -864,7 +876,7 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned 
long addr,
if (status  DSISR_NOHPTE)
valid |= HPTE_V_ABSENT;
 
-   index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid);
+   index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid, false);
if (index  0) {
if (status  DSISR_NOHPTE)
return status

[PATCH v3 0/2] powerpc kvm: fix deadlock scene

2013-11-07 Thread Liu Ping Fan
v2-v3:
  introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte()
  and hide the preemption detail inside this pair from the callers

Liu Ping Fan (2):
  powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte
  powerpc: kvm: fix rare but potential deadlock scene

 arch/powerpc/include/asm/kvm_book3s.h |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 10 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 29 -
 3 files changed, 30 insertions(+), 12 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] powerpc/kvm: remove redundant assignment

2013-11-06 Thread Liu Ping Fan
ret is assigned twice with the same value, so remove the later one.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Acked-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index dbc1478..9b97b42 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -733,7 +733,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
lock_rmap(rmap);
 
/* Check if we might have been invalidated; let the guest retry if so */
-   ret = RESUME_GUEST;
if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) {
unlock_rmap(rmap);
goto out_unlock;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene

2013-11-05 Thread Liu ping fan
On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote:
 On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote:
 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Good catch, we should have preemption disabled while ever we have a
 HPTE locked.

 @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct 
 kvm_vcpu *vcpu, gva_t eaddr,
   }

   /* Find the HPTE in the hash table */
 + preempt_disable();
   index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
HPTE_V_VALID | HPTE_V_ABSENT);
 + preempt_enable();

 Which means we need to add the preempt_enable after unlocking the
 HPTE, not here.

Yes. Sorry, but I am not sure about whether we can call
preempt_disable/enable() in realmode. I think since thread_info is
allocated with linear address, so we can use preempt_disable/enable()
inside kvmppc_hv_find_lock_hpte(), right?

Thanks and regards,
Pingfan

 Regards,
 Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene

2013-11-04 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 043eec8..28160ac 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
}
 
/* Find the HPTE in the hash table */
+   preempt_disable();
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
+   preempt_enable();
if (index  0)
return -ENOENT;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] powerpc/kvm: remove redundant assignment

2013-11-04 Thread Liu Ping Fan
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 28160ac..7682837 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -731,7 +731,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
lock_rmap(rmap);
 
/* Check if we might have been invalidated; let the guest retry if so */
-   ret = RESUME_GUEST;
if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) {
unlock_rmap(rmap);
goto out_unlock;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread

2013-11-04 Thread Liu Ping Fan
After the primary vcpu changes vcore_state to VCORE_RUNNING, there is
very little chance to schedule to secondary vcpu. So if we change the
code sequence around set vcore_state to VCORE_RUNNING and disable
preemption, we lost little. But we simplify the entering logi, based on
the fact that if primary vcpu runs, the secondary vcpu can not be scheduled.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..38b1fc0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1222,8 +1222,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
kvmppc_create_dtl_entry(vcpu, vc);
}
 
-   vc-vcore_state = VCORE_RUNNING;
preempt_disable();
+   vc-vcore_state = VCORE_RUNNING;
spin_unlock(vc-lock);
 
kvm_guest_enter();
@@ -1351,12 +1351,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 * this thread straight away and have it join in.
 */
if (!signal_pending(current)) {
-   if (vc-vcore_state == VCORE_RUNNING 
-   VCORE_EXIT_COUNT(vc) == 0) {
-   vcpu-arch.ptid = vc-n_runnable - 1;
-   kvmppc_create_dtl_entry(vcpu, vc);
-   kvmppc_start_thread(vcpu);
-   } else if (vc-vcore_state == VCORE_SLEEPING) {
+   if (vc-vcore_state == VCORE_SLEEPING) {
wake_up(vc-wq);
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: IRQFD: equip irqfd and resamplefd with polarity

2013-09-11 Thread Liu Ping Fan
Nowadays, irqfd can emulate trigger mode, but it can not emulate
trigger polarity. While in some cases, ioapic ioredtbl[x] expects
_low_ active. So equipping irqfd with the ability. Correspondingly,
resamplefd will have the same polarity as irqfd.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
This helps to step around making the interrupt component re-entrance
in qemu
---
 include/linux/kvm_host.h |  1 +
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/eventfd.c   | 32 ++--
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a63d83e..0b8c3b1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -678,6 +678,7 @@ bool kvm_is_mmio_pfn(pfn_t pfn);
 struct kvm_irq_ack_notifier {
struct hlist_node link;
unsigned gsi;
+   int polarity; /* 0 high active */
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
 };
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..bba3a1b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -740,6 +740,8 @@ struct kvm_xen_hvm_config {
  * emlation.  See Documentation/virtual/kvm/api.txt.
  */
 #define KVM_IRQFD_FLAG_RESAMPLE (1  1)
+/* 0: high-active */
+#define KVM_IRQFD_FLAG_POLARITY (12)
 
 struct kvm_irqfd {
__u32 fd;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 1550637..865c656 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -77,6 +77,7 @@ struct _irqfd {
struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
/* Used for level IRQ fast-path */
int gsi;
+   int polarity; /* 0 high active */
struct work_struct inject;
/* The resampler used by this irqfd (resampler-only) */
struct _irqfd_resampler *resampler;
@@ -100,13 +101,13 @@ irqfd_inject(struct work_struct *work)
struct kvm *kvm = irqfd-kvm;
 
if (!irqfd-resampler) {
-   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1,
-   false);
-   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0,
-   false);
+   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi,
+   !irqfd-polarity, false);
+   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi,
+   !!irqfd-polarity, false);
} else
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
-   irqfd-gsi, 1, false);
+   irqfd-gsi, !irqfd-polarity, false);
 }
 
 /*
@@ -123,7 +124,8 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
resampler = container_of(kian, struct _irqfd_resampler, notifier);
 
kvm_set_irq(resampler-kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
-   resampler-notifier.gsi, 0, false);
+   resampler-notifier.gsi,
+   !!resampler-notifier.polarity, false);
 
rcu_read_lock();
 
@@ -148,7 +150,7 @@ irqfd_resampler_shutdown(struct _irqfd *irqfd)
list_del(resampler-link);
kvm_unregister_irq_ack_notifier(kvm, resampler-notifier);
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
-   resampler-notifier.gsi, 0, false);
+   resampler-notifier.gsi, !!irqfd-polarity, false);
kfree(resampler);
}
 
@@ -302,6 +304,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 
irqfd-kvm = kvm;
irqfd-gsi = args-gsi;
+   irqfd-polarity = !!(args-flags  KVM_IRQFD_FLAG_POLARITY);
INIT_LIST_HEAD(irqfd-list);
INIT_WORK(irqfd-inject, irqfd_inject);
INIT_WORK(irqfd-shutdown, irqfd_shutdown);
@@ -337,8 +340,15 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
list_for_each_entry(resampler,
kvm-irqfds.resampler_list, link) {
if (resampler-notifier.gsi == irqfd-gsi) {
-   irqfd-resampler = resampler;
-   break;
+   if (likely(resampler-notifier.polarity ==
+   irqfd-polarity)) {
+   irqfd-resampler = resampler;
+   break;
+   } else {
+   ret = -EBUSY;
+   
mutex_unlock(kvm-irqfds.resampler_lock);
+   goto fail;
+   }
}
}
 
@@ -353,6 +363,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
resampler-kvm = kvm;
INIT_LIST_HEAD(resampler-list);
resampler

Re: What's the usage model (purpose) of interrupt remapping in IOMMU?

2013-08-13 Thread Liu ping fan
On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson
alex.william...@redhat.com wrote:
 On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote:
 Hi,

 In case of direct io, without the interrupt remapping in IOMMU (intel
 VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest
 when the guest is scheduled to specific CPU. At the beginning I
 thought with IOMMU's interrupt remapping, the hardware can directly
 forward the interrupt to guest without trapping into hypervisor when
 the interrupt happens, but after reading the Intel VT-d's manual, I
 found the interrupt mapping feature just add another mapping which
 allows software to control (mainly) the destination and vector, and we
 still need hypervisor to inject the interrupt when the guest is
 scheduled as only after the guest is scheduled, the target CPU can be
 known. If my understanding is correct, seems the interrupt remapping
 does not bring any performance improvement. So what's the benefit of
 IOMMU's interrupt remapping? Can someone explain the usage model of
 interrupt remapping in IOMMU?

 Interrupt remapping provides isolation and compatibility, not
 performance.  The hypervisor being able to direct interrupts to a target
 CPU also allows it the ability to filter interrupts and prevent the
 device from signaling spurious or malicious interrupts.  This is
 particularly important with message signaled interrupts since any device
 capable of DMA is able to inject random MSIs into the host.  The
 compatibility side is a feature of Intel platforms supporting x2apic.
 The interrupt remapper provides a translation layer to allow xapic aware
 hardware, such as ioapics, to function when the processors are switched
 to x2apic mode.  Thanks,

 Alex

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's the usage model (purpose) of interrupt remapping in IOMMU?

2013-08-13 Thread Liu ping fan
On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson
alex.william...@redhat.com wrote:
 On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote:
 Hi,

 In case of direct io, without the interrupt remapping in IOMMU (intel
 VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest
 when the guest is scheduled to specific CPU. At the beginning I
 thought with IOMMU's interrupt remapping, the hardware can directly
 forward the interrupt to guest without trapping into hypervisor when
 the interrupt happens, but after reading the Intel VT-d's manual, I
 found the interrupt mapping feature just add another mapping which
 allows software to control (mainly) the destination and vector, and we
 still need hypervisor to inject the interrupt when the guest is
 scheduled as only after the guest is scheduled, the target CPU can be
 known. If my understanding is correct, seems the interrupt remapping
 does not bring any performance improvement. So what's the benefit of
 IOMMU's interrupt remapping? Can someone explain the usage model of
 interrupt remapping in IOMMU?

 Interrupt remapping provides isolation and compatibility, not

The guest can not directly program the msi-x on pci device, so msix is
still under the control of host. Why do we need extra control
introduced by iommu ?

Thanks,
Pingfan
 performance.  The hypervisor being able to direct interrupts to a target
 CPU also allows it the ability to filter interrupts and prevent the
 device from signaling spurious or malicious interrupts.  This is
 particularly important with message signaled interrupts since any device
 capable of DMA is able to inject random MSIs into the host.  The
 compatibility side is a feature of Intel platforms supporting x2apic.
 The interrupt remapper provides a translation layer to allow xapic aware
 hardware, such as ioapics, to function when the processors are switched
 to x2apic mode.  Thanks,

 Alex

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's the usage model (purpose) of interrupt remapping in IOMMU?

2013-08-13 Thread Liu ping fan
On Wed, Aug 14, 2013 at 10:50 AM, Alex Williamson
alex.william...@redhat.com wrote:
 On Wed, 2013-08-14 at 10:37 +0800, Liu ping fan wrote:
 On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson
 alex.william...@redhat.com wrote:
  On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote:
  Hi,
 
  In case of direct io, without the interrupt remapping in IOMMU (intel
  VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest
  when the guest is scheduled to specific CPU. At the beginning I
  thought with IOMMU's interrupt remapping, the hardware can directly
  forward the interrupt to guest without trapping into hypervisor when
  the interrupt happens, but after reading the Intel VT-d's manual, I
  found the interrupt mapping feature just add another mapping which
  allows software to control (mainly) the destination and vector, and we
  still need hypervisor to inject the interrupt when the guest is
  scheduled as only after the guest is scheduled, the target CPU can be
  known. If my understanding is correct, seems the interrupt remapping
  does not bring any performance improvement. So what's the benefit of
  IOMMU's interrupt remapping? Can someone explain the usage model of
  interrupt remapping in IOMMU?
 
  Interrupt remapping provides isolation and compatibility, not

 The guest can not directly program the msi-x on pci device, so msix is
 still under the control of host. Why do we need extra control
 introduced by iommu ?

 An MSI interrupt is just a DMA write with a specific address and
 payload.  Any device capable of DMA can theoretically inject an MSI
 interrupt using other means besides the MSI/MSI-X configuration areas.

Thank you for the clear explanation.

Thanks and regards,
Pingfan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction

2012-10-25 Thread liu ping fan
On Thu, Oct 25, 2012 at 1:16 AM, Vasilis Liaskovitis
vasilis.liaskovi...@profitbricks.com wrote:
 Hi,

 On Wed, Oct 24, 2012 at 12:15:17PM +0200, Stefan Hajnoczi wrote:
 On Wed, Oct 24, 2012 at 10:06 AM, liu ping fan qemul...@gmail.com wrote:
  On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com 
  wrote:
  On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote:
  +static void dimm_populate(DimmDevice *s)
  +{
  +DeviceState *dev= (DeviceState*)s;
  +MemoryRegion *new = NULL;
  +
  +new = g_malloc(sizeof(MemoryRegion));
  +memory_region_init_ram(new, dev-id, s-size);
  +vmstate_register_ram_global(new);
  +memory_region_add_subregion(get_system_memory(), s-start, new);
  +s-mr = new;
  +}
  +
  +static void dimm_depopulate(DimmDevice *s)
  +{
  +assert(s);
  +vmstate_unregister_ram(s-mr, NULL);
  +memory_region_del_subregion(get_system_memory(), s-mr);
  +memory_region_destroy(s-mr);
  +s-mr = NULL;
  +}
 
  How is dimm hot unplug protected against callers who currently have RAM
  mapped (from cpu_physical_memory_map())?
 
  Emulated devices call cpu_physical_memory_map() directly or indirectly
  through DMA emulation code.  The RAM pointer may be held for arbitrary
  lengths of time, across main loop iterations, etc.
 
  It's not clear to me that it is safe to unplug a DIMM that has network
  or disk I/O buffers, for example.  We also need to be robust against
  malicious guests who abuse the hotplug lifecycle.  QEMU should never be
  left with dangling pointers.
 
  Not sure about the block layer. But I think those thread are already
  out of big lock, so there should be a MemoryListener to catch the
  RAM-unplug event, and if needed, bdrv_flush.

 do we want bdrv_flush, or some kind of cancel request e.g. bdrv_aio_cancel?

My original meaning is that flush out the dangling pointer.

 Here is the detailed scenario:

 1. Emulated device does cpu_physical_memory_map() and gets a pointer
 to guest RAM.
 2. Return to vcpu or iothread, continue processing...
 3. Hot unplug of RAM causes the guest RAM to disappear.
 4. Pending I/O completes and overwrites memory from dangling guest RAM 
 pointer.

 Any I/O device that does zero-copy I/O in QEMU faces this problem:
  * The block layer is affected.
  * The net layer is unaffected because it doesn't do zero-copy tx/rx
 across returns to the main loop (#2 above).
  * Not sure about other devices classes (e.g. USB).

 How should the MemoryListener callback work?  For block I/O it may not
 be possible to cancel pending I/O asynchronously - if you try to
 cancel then your thread may block until the I/O completes.

For current code, I think to block on the listener to wait for the
completion of flushing out. But after mr-ops's ref/unref patchset
accept, we can release the ref of RAM device after we have done with
it (it is a very raw idea, need to improve).

 e.g. paio_cancel does this?
 is there already an API to asynchronously cancel all in flight operations in a
 BlockDriverState? Afaict block_job_cancel refers to streaming jobs only and
 doesn't help here.

 Can we make the RAM unplug initiate async I/O cancellations, prevent further 
 I/Os,
 and only free the memory in a callback, after all DMA I/O to the associated 
 memory
 region has been cancelled or completed?

 Also iiuc the MemoryListener should be registered from users of
 cpu_physical_memory_map e.g. hw/virtio.c

Yes.
 By the way dimm_depopulate only frees the qemu memory on an ACPI _EJ request, 
 which
 means that a well-behaved guest will have already offlined the memory and is 
 not
 using it anymore. If the guest still uses the memory e.g. for a DMA buffer, 
 the
 logical memory offlining will fail and the _EJ/qemu memory freeing will never
 happen.

Yes.
 But in theory a malicious acpi guest driver could trigger _EJ requests to do 
 step
 3 above.

 Or perhaps the backing block driver can finish an I/O request for a zero-copy
 block device that the guest doesn't care for anymore? I 'll think about this a
 bit more.

The guest is one of the users of  dimm device, and block layer is another one.

Regards,
pingfan
 Synchronous cancel behavior is not workable since it can lead to poor
 latency or hangs in the guest.

 ok

 thanks,

 - Vasilis

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction

2012-10-24 Thread liu ping fan
On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote:
 +static void dimm_populate(DimmDevice *s)
 +{
 +DeviceState *dev= (DeviceState*)s;
 +MemoryRegion *new = NULL;
 +
 +new = g_malloc(sizeof(MemoryRegion));
 +memory_region_init_ram(new, dev-id, s-size);
 +vmstate_register_ram_global(new);
 +memory_region_add_subregion(get_system_memory(), s-start, new);
 +s-mr = new;
 +}
 +
 +static void dimm_depopulate(DimmDevice *s)
 +{
 +assert(s);
 +vmstate_unregister_ram(s-mr, NULL);
 +memory_region_del_subregion(get_system_memory(), s-mr);
 +memory_region_destroy(s-mr);
 +s-mr = NULL;
 +}

 How is dimm hot unplug protected against callers who currently have RAM
 mapped (from cpu_physical_memory_map())?

 Emulated devices call cpu_physical_memory_map() directly or indirectly
 through DMA emulation code.  The RAM pointer may be held for arbitrary
 lengths of time, across main loop iterations, etc.

 It's not clear to me that it is safe to unplug a DIMM that has network
 or disk I/O buffers, for example.  We also need to be robust against
 malicious guests who abuse the hotplug lifecycle.  QEMU should never be
 left with dangling pointers.

Not sure about the block layer. But I think those thread are already
out of big lock, so there should be a MemoryListener to catch the
RAM-unplug event, and if needed, bdrv_flush.

Regards,
pingfan
 Stefan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread liu ping fan
On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.

 - protecting MemoryRegion::opaque during dispatch

 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).

Is there log for this topic? Link?

Thanks,
pingfan
 --
 error compiling committee.c: too many arguments to function
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, September 4th

2012-09-04 Thread liu ping fan
On Tue, Sep 4, 2012 at 4:21 PM, Avi Kivity a...@redhat.com wrote:
 On 09/04/2012 11:17 AM, liu ping fan wrote:
 On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 09:44 AM, Juan Quintela wrote:

 Hi

 Please send in any agenda items you are interested in covering.

 - protecting MemoryRegion::opaque during dispatch

 I'm guessing Ping won't make it due to timezone problems.  Jan, if you
 will not participate, please remove the topic from the list (unless
 someone else wants to argue your side).

 Is there log for this topic? Link?

 The call has not happened yet (it's in 5 hours 40 minutes), but Jan
 can't participate, so it will likely be cancelled.  If you can
 participate, maybe we can have it anyway.

Sorry, I can not attend.

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-11 Thread liu ping fan
On Sat, Aug 11, 2012 at 9:58 AM, liu ping fan qemul...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 5:41 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Flatview and radix view are all under the protection of pointer.
 And this make sure the change of them seem to be atomic!

 The mr accessed by radix-tree leaf or flatview will be reclaimed
 after the prev PhysMap not in use any longer


 IMO this cleverness should come much later.  Let's first take care of
 dropping the big qemu lock, then make swithcing memory maps more efficient.

 The initial paths could look like:

   lookup:
  take mem_map_lock
  lookup
  take ref
  drop mem_map_lock

   update:
  take mem_map_lock (in core_begin)
  do updates
  drop memo_map_lock

 Later we can replace mem_map_lock with either a rwlock or (real) rcu.



  #if !defined(CONFIG_USER_ONLY)

 -static void phys_map_node_reserve(unsigned nodes)
 +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
  {
 -if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
 +if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
  typedef PhysPageEntry Node[L2_SIZE];
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
 -  phys_map_nodes_nb + nodes);
 -phys_map_nodes = g_renew(Node, phys_map_nodes,
 - phys_map_nodes_nb_alloc);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 
 2,
 +
 16);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
 +  map-phys_map_nodes_nb + nodes);
 +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
 + map-phys_map_nodes_nb_alloc);
  }
  }

 Please have a patch that just adds the map parameter to all these
 functions.  This makes the later patch, that adds the copy, easier to read.

 +
 +void cur_map_update(PhysMap *next)
 +{
 +qemu_mutex_lock(cur_map_lock);
 +physmap_put(cur_map);
 +cur_map = next;
 +smp_mb();
 +qemu_mutex_unlock(cur_map_lock);
 +}

 IMO this can be mem_map_lock.

 If we take my previous suggestion:

   lookup:
  take mem_map_lock
  lookup
  take ref
  drop mem_map_lock

   update:
  take mem_map_lock (in core_begin)
  do updates
  drop memo_map_lock

 And update it to


   update:
  prepare next_map (in core_begin)
  do updates
  take mem_map_lock (in core_commit)
  switch maps
  drop mem_map_lock
  free old map


 Note the lookup path copies the MemoryRegionSection instead of
 referencing it.  Thus we can destroy the old map without worrying; the
 only pointers will point to MemoryRegions, which will be protected by
 the refcounts on their Objects.

 Just find there may be a leak here. If mrs points to subpage, then the
 subpage_t  could be crashed by destroy.
 To avoid such situation, we can walk down the chain to pin us on the
 Object based mr, but then we must expose the address convert in
 subpage_read() right here. Right?

Oh, just read the code logic and I think walk down the chain is
enough. And subpage_read/write() is bypass, so no need for fold the
addr translation.

Regards,
pingfan

 Regards,
 pingfan

 This can be easily switched to rcu:

   update:
  prepare next_map (in core_begin)
  do updates
  switch maps - rcu_assign_pointer
  call_rcu(free old map) (or synchronize_rcu; free old maps)

 Again, this should be done after the simplictic patch that enables
 parallel lookup but keeps just one map.



 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-10 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:00 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 09/08/2012 09:28, liu ping fan ha scritto:
  VCPU threadI/O thread
  =
  get MMIO request
  rcu_read_lock()
  walk memory map
 qdev_unmap()
 lock_devtree()
 ...
 unlock_devtree
 unref dev - refcnt=0, free enqueued
  ref()
 No ref() for dev here, while we have ref to flatview+radix in my patches.
 I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
 inc when it is added into mem view -- that is
 memory_region_add_subregion - memory_region_get() {
 if(atomic_add_and_return()) dev-ref++  }.
 So not until reclaimer of mem view, the dev's ref is hold by mem view.
 In a short word, rcu protect mem view, while device is protected by refcnt.

 But the RCU critical section should not include the whole processing of
 MMIO, only the walk of the memory map.

Yes, you are right.  And I think cur_map_get() can be broken into the
style lock,  ref++, phys_page_find(); unlock.  easily.

 And in general I think this is a bit too tricky... I understand not
 adding refcounting to all of bottom halves, timers, etc., but if you are
 using a device you should have explicit ref/unref pairs.

Actually, there are pairs -- when dev enter mem view, inc ref; and
when it leave, dec ref.
But as Avi has pointed out, the mr-refcnt introduce complicate and no
gain. So I will discard this design

Thanks and regards,
pingfan

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-10 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:40 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 09/08/2012 09:28, liu ping fan ha scritto:
  +static void
  +pci_e1000_unmap(PCIDevice *p)
  +{
  +/* DO NOT FREE anything!until refcnt=0 */
  +/* isolate from memory view */
  +}
 
  At least you need to call the superclass method.
 
 Refer to  0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch,
 we have the following sequence
 qdev_unmap-pci_unmap_device-pci_e1000_unmap.  So pci_e1000_unmap
 need not to do anything.

 But then this patch is unnecessary, isn't it?

Yes,  should remove from next version

Regards,
pingfan


 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-10 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:18 PM, Avi Kivity a...@redhat.com wrote:
 On 08/09/2012 10:49 AM, Paolo Bonzini wrote:
 Il 09/08/2012 09:33, liu ping fan ha scritto:
 Yes, it is to defer destructors.
 See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
 When MemoryRegion is _del_subregion from mem in updater, it may be
 still in use by reader -- radix or flatview, so defer its destructors
 to the reclaimer --phys_map_release(PhysMap *map)

 How are you sure that the reader is already out of its critical section
 by the time the reclaimer runs?

 If we have rcu, it could be elegant to do this.

 Yeah, I think inventing primitives is dangerous and difficult to review;
 and it may be difficult to replace it with proper call_rcu.

 You should probably make a proof-of-concept using liburcu.  Then we can
 decide how to implement RCU in a way that is portable enough for QEMU's
 needs.

 IMO we should start with a simple mutex (which will cover only the
 lookup and map rebuild).  This should reduce the contention to basically
 nothing (still leaving a cache line bounce).  If a profile shows the
 cache line bounce hurting us, or perhaps contention in ultralarge
 guests, then we should switch to rcu.

Agree, I think this will pin us on the major issue -- mmio perfermance

Regards,
pingfan

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-10 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:38 PM, Avi Kivity a...@redhat.com wrote:
 On 08/09/2012 10:27 AM, liu ping fan wrote:
 On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Using refcnt for mr, so we can separate mr's life cycle management
 from refered object.
   When mr-ref 0-1, inc the refered object.
   When mr-ref 1-0, dec the refered object.

 The refered object can be DeviceStae, another mr, or other opaque.

 Please explain the motivation more fully.

 Actually, the aim is to mange the reference of an object, used by mem view.
 DeviceState can be referred by different system, when it comes to the
 view of subsystem, we hold dev's ref. And any indirect reference will
 just mr-ref++, not dev's.
 This can help us avoid the down-walk through the referred chain, like
 alias mr --- DeviceState.

 That is a lot of complexity, for no gain.  Manipulating memory regions
 is a slow path, and can be done under the bit qemu lock without any
 complications.

OK. I will discard this design.

 In the previous discussion, you have suggest add dev-ref++ in
 core_region_add.  But I think, if we can move it to higher layer --
 memory_region_{add,del}_subregion, so we can avoid to duplicate do
 this in other xx_region_add.

 Why would other memory listeners be impacted?  They all operate under
 the big qemu lock.  If they start using devices outside the lock, then
 they need to take a reference.

Yes, if unplug path in the protection of big lock.
And just one extra question, for ram-unplug scene, how do we protect from:
  updater:  ram-unplug --qemu free() -- brk() invalidate this vaddr interval
  reader:  vhost-thread copy data from the interval
I guess something like lock/ref used by them, but can not find such
mechanism in vhost_set_memory() to protect the scene against
vhost_worker()

Thanks and regards,
pingfan

 As a payment for this, we need to handle alias which can be avoid at
 core_region_add().  And mr's ref can help to avoid
  the down-walk.

 The payment is two systems of reference counts.

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-10 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:41 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Flatview and radix view are all under the protection of pointer.
 And this make sure the change of them seem to be atomic!

 The mr accessed by radix-tree leaf or flatview will be reclaimed
 after the prev PhysMap not in use any longer


 IMO this cleverness should come much later.  Let's first take care of
 dropping the big qemu lock, then make swithcing memory maps more efficient.

 The initial paths could look like:

   lookup:
  take mem_map_lock
  lookup
  take ref
  drop mem_map_lock

   update:
  take mem_map_lock (in core_begin)
  do updates
  drop memo_map_lock

 Later we can replace mem_map_lock with either a rwlock or (real) rcu.



  #if !defined(CONFIG_USER_ONLY)

 -static void phys_map_node_reserve(unsigned nodes)
 +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
  {
 -if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
 +if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
  typedef PhysPageEntry Node[L2_SIZE];
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
 -  phys_map_nodes_nb + nodes);
 -phys_map_nodes = g_renew(Node, phys_map_nodes,
 - phys_map_nodes_nb_alloc);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
 +16);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
 +  map-phys_map_nodes_nb + nodes);
 +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
 + map-phys_map_nodes_nb_alloc);
  }
  }

 Please have a patch that just adds the map parameter to all these
 functions.  This makes the later patch, that adds the copy, easier to read.

 +
 +void cur_map_update(PhysMap *next)
 +{
 +qemu_mutex_lock(cur_map_lock);
 +physmap_put(cur_map);
 +cur_map = next;
 +smp_mb();
 +qemu_mutex_unlock(cur_map_lock);
 +}

 IMO this can be mem_map_lock.

 If we take my previous suggestion:

   lookup:
  take mem_map_lock
  lookup
  take ref
  drop mem_map_lock

   update:
  take mem_map_lock (in core_begin)
  do updates
  drop memo_map_lock

 And update it to


   update:
  prepare next_map (in core_begin)
  do updates
  take mem_map_lock (in core_commit)
  switch maps
  drop mem_map_lock
  free old map


 Note the lookup path copies the MemoryRegionSection instead of
 referencing it.  Thus we can destroy the old map without worrying; the
 only pointers will point to MemoryRegions, which will be protected by
 the refcounts on their Objects.

Just find there may be a leak here. If mrs points to subpage, then the
subpage_t  could be crashed by destroy.
To avoid such situation, we can walk down the chain to pin us on the
Object based mr, but then we must expose the address convert in
subpage_read() right here. Right?

Regards,
pingfan

 This can be easily switched to rcu:

   update:
  prepare next_map (in core_begin)
  do updates
  switch maps - rcu_assign_pointer
  call_rcu(free old map) (or synchronize_rcu; free old maps)

 Again, this should be done after the simplictic patch that enables
 parallel lookup but keeps just one map.



 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Using refcnt for mr, so we can separate mr's life cycle management
 from refered object.
   When mr-ref 0-1, inc the refered object.
   When mr-ref 1-0, dec the refered object.

 The refered object can be DeviceStae, another mr, or other opaque.

 Please explain the motivation more fully.

Actually, the aim is to mange the reference of an object, used by mem view.
DeviceState can be referred by different system, when it comes to the
view of subsystem, we hold dev's ref. And any indirect reference will
just mr-ref++, not dev's.
This can help us avoid the down-walk through the referred chain, like
alias mr --- DeviceState.

In the previous discussion, you have suggest add dev-ref++ in
core_region_add.  But I think, if we can move it to higher layer --
memory_region_{add,del}_subregion, so we can avoid to duplicate do
this in other xx_region_add.
As a payment for this, we need to handle alias which can be avoid at
core_region_add().  And mr's ref can help to avoid
 the down-walk.

Regards,
pingfan
 Usually a MemoryRegion will be embedded within some DeviceState, or its
 lifecycle will be managed by the DeviceState.  So long as we keep the
 DeviceState alive all associated MemoryRegions should be alive as well.
  Why not do this directly?


 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:42 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com


 Please explain the motivation.  AFAICT, the big qemu lock is sufficient.

Oh, this is one of the series locks for the removal of big qemu lock.
The degradation of big lock will take several steps, including to
introduce device's private lock. Till then, when the device add path
from iothread and the remove path in io-dispatch is out of the big
qemu lock.  We need this extra lock.

These series is too big, so I send out the 1st phase for review.

Regards,
pingan

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:41 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  cpus.c  |   12 
  main-loop.h |3 +++
  2 files changed, 15 insertions(+), 0 deletions(-)

 diff --git a/cpus.c b/cpus.c
 index b182b3d..a734b36 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void)
  }
  #endif /* _WIN32 */

 +QemuMutex qemu_device_tree_mutex;
  QemuMutex qemu_global_mutex;
  static QemuCond qemu_io_proceeded_cond;
  static bool iothread_requesting_mutex;
 @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void)
  qemu_cond_init(qemu_work_cond);
  qemu_cond_init(qemu_io_proceeded_cond);
  qemu_mutex_init(qemu_global_mutex);
 +qemu_mutex_init(qemu_device_tree_mutex);

  qemu_thread_get_self(io_thread);
  }
 @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void)
  qemu_mutex_unlock(qemu_global_mutex);
  }

 +void qemu_lock_devtree(void)
 +{
 +qemu_mutex_lock(qemu_device_tree_mutex);
 +}
 +
 +void qemu_unlock_devtree(void)
 +{
 +qemu_mutex_unlock(qemu_device_tree_mutex);
 +}

 We don't need the wrappers.  They are there for the big lock just
 because TCG needs extra work for iothread_requesting_mutex.

Sorry, could you give more detail about TCG, what is extra work.

Thanks,
pingfan

 Paolo

  static int all_vcpus_paused(void)
  {
  CPUArchState *penv = first_cpu;
 diff --git a/main-loop.h b/main-loop.h
 index dce1cd9..17e959a 100644
 --- a/main-loop.h
 +++ b/main-loop.h
 @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void);
   */
  void qemu_mutex_unlock_iothread(void);

 +void qemu_lock_devtree(void);
 +void qemu_unlock_devtree(void);
 +
  /* internal interfaces */

  void qemu_fd_register(int fd);



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:56 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/08/2012 08:25, Liu Ping Fan ha scritto:

 +static void
 +pci_e1000_unmap(PCIDevice *p)
 +{
 +/* DO NOT FREE anything!until refcnt=0 */
 +/* isolate from memory view */
 +}

 At least you need to call the superclass method.

Refer to  0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch,
we have the following sequence
qdev_unmap-pci_unmap_device-pci_e1000_unmap.  So pci_e1000_unmap
need not to do anything.

Regards,
pingfan

 Paolo

  static int
  pci_e1000_uninit(PCIDevice *dev)
  {
 @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void 
 *data)
  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);

  k-init = pci_e1000_init;
 +k-unmap = pci_e1000_unmap;
  k-exit = pci_e1000_uninit;
  k-romfile = pxe-e1000.rom;
  k-vendor_id = PCI_VENDOR_ID_INTEL;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:52 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 +void qdev_unplug_complete(DeviceState *dev, Error **errp)
 +{
 +/* isolate from mem view */
 +qdev_unmap(dev);
 +qemu_lock_devtree();
 +/* isolate from device tree */
 +qdev_unset_parent(dev);
 +qemu_unlock_devtree();
 +object_unref(OBJECT(dev));

 Rather than deferring the free, you should defer the unref.  Otherwise
 the following can happen when you have real RCU access to the memory
 map on the read-side:

 VCPU threadI/O thread
 =
 get MMIO request
 rcu_read_lock()
 walk memory map
qdev_unmap()
lock_devtree()
...
unlock_devtree
unref dev - refcnt=0, free enqueued
 ref()

No ref() for dev here, while we have ref to flatview+radix in my patches.
I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
inc when it is added into mem view -- that is
memory_region_add_subregion - memory_region_get() {
if(atomic_add_and_return()) dev-ref++  }.
So not until reclaimer of mem view, the dev's ref is hold by mem view.
In a short word, rcu protect mem view, while device is protected by refcnt.

 rcu_read_unlock()
free()
 dangling pointer!

 If you defer the unref, you have instead

 VCPU threadI/O thread
 =
 get MMIO request
 rcu_read_lock()
 walk memory map
qdev_unmap()
lock_devtree()
...
unlock_devtree
unref is enqueued
 ref() - refcnt = 2
 rcu_read_unlock()
unref() - refcnt=1
 unref() - refcnt = 1

 So this also makes patch 14 unnecessary.

 Paolo

 +}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:17 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Using mem_map_lock to protect among updaters. So we can get the intact
 snapshot of mem topology -- FlatView  radix-tree.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |3 +++
  memory.c |   22 ++
  memory.h |2 ++
  3 files changed, 27 insertions(+), 0 deletions(-)

 diff --git a/exec.c b/exec.c
 index 8244d54..0e29ef9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, 
 phys_map_nodes_nb_alloc;
 The bottom level has pointers to MemoryRegionSections.  */
  static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };

 +QemuMutex mem_map_lock;
 +
  static void io_mem_init(void);
  static void memory_map_init(void);

 @@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
  #if !defined(CONFIG_USER_ONLY)
  memory_map_init();
  io_mem_init();
 +qemu_mutex_init(mem_map_lock);

 I'd move this and the mutex to memory.c since there are no other uses.
 The mutex could be static then.

But the init entry is in exec.c, not memory.c.

Regards,
pingfan

  #endif
  }

 diff --git a/memory.c b/memory.c
 index aab4a31..5986532 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
  assert(memory_region_transaction_depth);
  --memory_region_transaction_depth;
  if (!memory_region_transaction_depth  memory_region_update_pending) {
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(NULL);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool 
 log, unsigned client)
  {
  uint8_t mask = 1  client;

 +qemu_mutex_lock(mem_map_lock);
  mr-dirty_log_mask = (mr-dirty_log_mask  ~mask) | (log * mask);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
 @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
  void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
  {
  if (mr-readonly != readonly) {
 +qemu_mutex_lock(mem_map_lock);
  mr-readonly = readonly;
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1112,7 +1118,9 @@ void 
 memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable)
  {
  if (mr-readable != readable) {
  mr-readable = readable;
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
  };
  unsigned i;

 +qemu_mutex_lock(mem_map_lock);
  for (i = 0; i  mr-ioeventfd_nb; ++i) {
  if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) {
  break;
 @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
  sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i));
  mr-ioeventfds[i] = mrfd;
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  void memory_region_del_eventfd(MemoryRegion *mr,
 @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
  };
  unsigned i;

 +qemu_mutex_lock(mem_map_lock);
  for (i = 0; i  mr-ioeventfd_nb; ++i) {
  if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) {
  break;
 @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
  mr-ioeventfds = g_realloc(mr-ioeventfds,
sizeof(*mr-ioeventfds)*mr-ioeventfd_nb 
 + 1);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }

  static void memory_region_add_subregion_common(MemoryRegion *mr,
 @@ -1259,6 +1271,8 @@ static void 
 memory_region_add_subregion_common(MemoryRegion *mr,
  assert(!subregion-parent);
  subregion-parent = mr;
  subregion-addr = offset;
 +
 +qemu_mutex_lock(mem_map_lock);
  QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
  if (subregion-may_overlap || other-may_overlap) {
  continue;
 @@ -1289,6 +1303,7 @@ static void 
 memory_region_add_subregion_common(MemoryRegion *mr,
  QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link);
  done:
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock);
  }


 @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
  {
  assert(subregion-parent == mr);
  subregion-parent = NULL;
 +
 +qemu_mutex_lock(mem_map_lock);
  QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
  memory_region_update_topology(mr);
 +qemu_mutex_unlock(mem_map_lock

Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:13 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Using mem_map_lock to protect among updaters. So we can get the intact
 snapshot of mem topology -- FlatView  radix-tree.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |3 +++
  memory.c |   22 ++
  memory.h |2 ++
  3 files changed, 27 insertions(+), 0 deletions(-)

 diff --git a/exec.c b/exec.c
 index 8244d54..0e29ef9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, 
 phys_map_nodes_nb_alloc;
 The bottom level has pointers to MemoryRegionSections.  */
  static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };

 +QemuMutex mem_map_lock;
 +
  static void io_mem_init(void);
  static void memory_map_init(void);

 @@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
  #if !defined(CONFIG_USER_ONLY)
  memory_map_init();
  io_mem_init();
 +qemu_mutex_init(mem_map_lock);
  #endif
  }

 diff --git a/memory.c b/memory.c
 index aab4a31..5986532 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
  assert(memory_region_transaction_depth);
  --memory_region_transaction_depth;
  if (!memory_region_transaction_depth  memory_region_update_pending) {
 +qemu_mutex_lock(mem_map_lock);
  memory_region_update_topology(NULL);
 +qemu_mutex_unlock(mem_map_lock);
  }
  }

 Seems to me that nothing in memory.c can susceptible to races.  It must
 already be called under the big qemu lock, and with the exception of
 mutators (memory_region_set_*), changes aren't directly visible.

Yes, what I want to do is prepare unplug out of protection of global
lock.  When io-dispatch and mmio-dispatch are all out of big lock, we
will run into the following scene:
In vcpu context A, qdev_unplug_complete()- delete subregion;
In context B, write pci bar -- pci mapping update- add subregion

 I think it's sufficient to take the mem_map_lock at the beginning of
 core_begin() and drop it at the end of core_commit().  That means all
 updates of volatile state, phys_map, are protected.

The mem_map_lock is to protect both address_space_io and
address_space_memory. When without the protection of big lock,
competing will raise among the updaters
(memory_region_{add,del}_subregion and the readers
generate_memory_topology()-render_memory_region().

If just in core_begin/commit, we will duplicate it for
xx_begin/commit, right?  And at the same time, mr-subregions is
exposed under SMP without big lock.

Thanks and regards,
pingfan



 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:18 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 PhysMap contain the flatview and radix-tree view, they are snapshot
 of system topology and should be consistent. With PhysMap, we can
 swap the pointer when updating and achieve the atomic.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c   |8 
  memory.c |   33 -
  memory.h |   62 
 --
  3 files changed, 60 insertions(+), 43 deletions(-)

 diff --git a/exec.c b/exec.c
 index 0e29ef9..01b91b0 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -156,8 +156,6 @@ typedef struct PageDesc {
  #endif

  /* Size of the L2 (and L3, etc) page tables.  */

 Please copy this comment to the header file.

OK, thanks.
pingfan

 -#define L2_BITS 10
 -#define L2_SIZE (1  L2_BITS)

  #define P_L2_LEVELS \
  (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
 @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask;
  static void *l1_map[V_L1_SIZE];

  #if !defined(CONFIG_USER_ONLY)
 -typedef struct PhysPageEntry PhysPageEntry;

  static MemoryRegionSection *phys_sections;
  static unsigned phys_sections_nb, phys_sections_nb_alloc;
 @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty;
  static uint16_t phys_section_rom;
  static uint16_t phys_section_watch;

 -struct PhysPageEntry {
 -uint16_t is_leaf : 1;
 - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
 -uint16_t ptr : 15;
 -};

  /* Simple allocator for PhysPageEntry nodes */
  static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
 diff --git a/memory.c b/memory.c
 index 2eaa2fc..c7f2cfd 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -31,17 +31,6 @@ static bool global_dirty_log = false;
  static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners
  = QTAILQ_HEAD_INITIALIZER(memory_listeners);

 -typedef struct AddrRange AddrRange;
 -
 -/*
 - * Note using signed integers limits us to physical addresses at most
 - * 63 bits wide.  They are needed for negative offsetting in aliases
 - * (large MemoryRegion::alias_offset).
 - */
 -struct AddrRange {
 -Int128 start;
 -Int128 size;
 -};

  static AddrRange addrrange_make(Int128 start, Int128 size)
  {
 @@ -197,28 +186,6 @@ static bool 
 memory_region_ioeventfd_equal(MemoryRegionIoeventfd a,
   !memory_region_ioeventfd_before(b, a);
  }

 -typedef struct FlatRange FlatRange;
 -typedef struct FlatView FlatView;
 -
 -/* Range of memory in the global map.  Addresses are absolute. */
 -struct FlatRange {
 -MemoryRegion *mr;
 -target_phys_addr_t offset_in_region;
 -AddrRange addr;
 -uint8_t dirty_log_mask;
 -bool readable;
 -bool readonly;
 -};
 -
 -/* Flattened global view of current active memory hierarchy.  Kept in sorted
 - * order.
 - */
 -struct FlatView {
 -FlatRange *ranges;
 -unsigned nr;
 -unsigned nr_allocated;
 -};
 -
  typedef struct AddressSpace AddressSpace;
  typedef struct AddressSpaceOps AddressSpaceOps;

 diff --git a/memory.h b/memory.h
 index 740f018..357edd8 100644
 --- a/memory.h
 +++ b/memory.h
 @@ -29,12 +29,72 @@
  #include qemu-thread.h
  #include qemu/reclaimer.h

 +typedef struct AddrRange AddrRange;
 +typedef struct FlatRange FlatRange;
 +typedef struct FlatView FlatView;
 +typedef struct PhysPageEntry PhysPageEntry;
 +typedef struct PhysMap PhysMap;
 +typedef struct MemoryRegionSection MemoryRegionSection;
  typedef struct MemoryRegionOps MemoryRegionOps;
  typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
  typedef struct MemoryRegion MemoryRegion;
  typedef struct MemoryRegionPortio MemoryRegionPortio;
  typedef struct MemoryRegionMmio MemoryRegionMmio;

 +/*
 + * Note using signed integers limits us to physical addresses at most
 + * 63 bits wide.  They are needed for negative offsetting in aliases
 + * (large MemoryRegion::alias_offset).
 + */
 +struct AddrRange {
 +Int128 start;
 +Int128 size;
 +};
 +
 +/* Range of memory in the global map.  Addresses are absolute. */
 +struct FlatRange {
 +MemoryRegion *mr;
 +target_phys_addr_t offset_in_region;
 +AddrRange addr;
 +uint8_t dirty_log_mask;
 +bool readable;
 +bool readonly;
 +};
 +
 +/* Flattened global view of current active memory hierarchy.  Kept in sorted
 + * order.
 + */
 +struct FlatView {
 +FlatRange *ranges;
 +unsigned nr;
 +unsigned nr_allocated;
 +};
 +
 +struct PhysPageEntry {
 +uint16_t is_leaf:1;
 + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
 +uint16_t ptr:15;
 +};
 +
 +#define L2_BITS 10
 +#define L2_SIZE (1  L2_BITS)
 +/* This is a multi-level map on the physical address space.
 +   The bottom level has pointers to MemoryRegionSections.  */
 +struct PhysMap {
 +Atomic ref;
 +PhysPageEntry root;
 +PhysPageEntry

Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:23 AM, Blue Swirl blauwir...@gmail.com wrote:
 On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Flatview and radix view are all under the protection of pointer.
 And this make sure the change of them seem to be atomic!

 The mr accessed by radix-tree leaf or flatview will be reclaimed
 after the prev PhysMap not in use any longer

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  exec.c  |  303 
 +++---
  hw/vhost.c  |2 +-
  hw/xen_pt.c |2 +-
  kvm-all.c   |2 +-
  memory.c|   92 ++-
  memory.h|9 ++-
  vl.c|1 +
  xen-all.c   |2 +-
  8 files changed, 286 insertions(+), 127 deletions(-)

 diff --git a/exec.c b/exec.c
 index 01b91b0..97addb9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -24,6 +24,7 @@
  #include sys/mman.h
  #endif

 +#include qemu/atomic.h
  #include qemu-common.h
  #include cpu.h
  #include tcg.h
 @@ -35,6 +36,8 @@
  #include qemu-timer.h
  #include memory.h
  #include exec-memory.h
 +#include qemu-thread.h
 +#include qemu/reclaimer.h
  #if defined(CONFIG_USER_ONLY)
  #include qemu.h
  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
 @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE];

  #if !defined(CONFIG_USER_ONLY)

 -static MemoryRegionSection *phys_sections;
 -static unsigned phys_sections_nb, phys_sections_nb_alloc;
  static uint16_t phys_section_unassigned;
  static uint16_t phys_section_notdirty;
  static uint16_t phys_section_rom;
  static uint16_t phys_section_watch;

 -
 -/* Simple allocator for PhysPageEntry nodes */
 -static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
 -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
 -
  #define PHYS_MAP_NODE_NIL (((uint16_t)~0)  1)

 -/* This is a multi-level map on the physical address space.
 -   The bottom level has pointers to MemoryRegionSections.  */
 -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
 -
 +static QemuMutex cur_map_lock;
 +static PhysMap *cur_map;
  QemuMutex mem_map_lock;
 +static PhysMap *next_map;

  static void io_mem_init(void);
  static void memory_map_init(void);
 @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index)

  #if !defined(CONFIG_USER_ONLY)

 -static void phys_map_node_reserve(unsigned nodes)
 +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
  {
 -if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
 +if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
  typedef PhysPageEntry Node[L2_SIZE];
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
 -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
 -  phys_map_nodes_nb + nodes);
 -phys_map_nodes = g_renew(Node, phys_map_nodes,
 - phys_map_nodes_nb_alloc);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
 +16);
 +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
 +  map-phys_map_nodes_nb + nodes);
 +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
 + map-phys_map_nodes_nb_alloc);
  }
  }

 -static uint16_t phys_map_node_alloc(void)
 +static uint16_t phys_map_node_alloc(PhysMap *map)
  {
  unsigned i;
  uint16_t ret;

 -ret = phys_map_nodes_nb++;
 +ret = map-phys_map_nodes_nb++;
  assert(ret != PHYS_MAP_NODE_NIL);
 -assert(ret != phys_map_nodes_nb_alloc);
 +assert(ret != map-phys_map_nodes_nb_alloc);
  for (i = 0; i  L2_SIZE; ++i) {
 -phys_map_nodes[ret][i].is_leaf = 0;
 -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
 +map-phys_map_nodes[ret][i].is_leaf = 0;
 +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
  }
  return ret;
  }

 -static void phys_map_nodes_reset(void)
 -{
 -phys_map_nodes_nb = 0;
 -}
 -
 -
 -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t 
 *index,
 -target_phys_addr_t *nb, uint16_t leaf,
 +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp,
 +target_phys_addr_t *index,
 +target_phys_addr_t *nb,
 +uint16_t leaf,
  int level)
  {
  PhysPageEntry *p;
 @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, 
 target_phys_addr_t *index,
  target_phys_addr_t step = (target_phys_addr_t)1  (level * L2_BITS);

  if (!lp-is_leaf  lp-ptr == PHYS_MAP_NODE_NIL) {
 -lp-ptr = phys_map_node_alloc();
 -p = phys_map_nodes[lp-ptr];
 +lp-ptr = phys_map_node_alloc(map);
 +p = map

Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:15 PM, Avi Kivity a...@redhat.com wrote:
 On 08/08/2012 12:07 PM, Paolo Bonzini wrote:
 Il 08/08/2012 11:05, Avi Kivity ha scritto:
  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
  Collect unused object and release them at caller demand.
 
 Please explain the motivation for this patch.

 It's poor man RCU, I think?

 I thought that it was to defer destructors (finalizers) to a more
 suitable context.  But why is the unref context unsuitable?

Yes, it is to defer destructors.
See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
When MemoryRegion is _del_subregion from mem in updater, it may be
still in use by reader -- radix or flatview, so defer its destructors
to the reclaimer --phys_map_release(PhysMap *map)
If we have rcu, it could be elegant to do this.

I think, I should write the commit comment here too,  not until the
followed patch.

Regards, pingfan
 I don't see how it relates to RCU, where is the C and the U?

 Anyway the list eagerly awaits the explanation.

 --
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:35 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Collect unused object and release them at caller demand.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/qemu/reclaimer.h |   28 ++
  main-loop.c  |5 
  qemu-tool.c  |5 
  qom/Makefile.objs|2 +-
  qom/reclaimer.c  |   58 
 ++
  5 files changed, 97 insertions(+), 1 deletions(-)
  create mode 100644 include/qemu/reclaimer.h
  create mode 100644 qom/reclaimer.c

 diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
 new file mode 100644
 index 000..9307e93
 --- /dev/null
 +++ b/include/qemu/reclaimer.h
 @@ -0,0 +1,28 @@
 +/*
 + * QEMU reclaimer
 + *
 + * Copyright IBM, Corp. 2012
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#ifndef QEMU_RECLAIMER
 +#define QEMU_RECLAIMER
 +
 +typedef void ReleaseHandler(void *opaque);
 +typedef struct Chunk {
 +QLIST_ENTRY(Chunk) list;
 +void *opaque;
 +ReleaseHandler *release;
 +} Chunk;
 +
 +typedef struct ChunkHead {
 +struct Chunk *lh_first;
 +} ChunkHead;
 +
 +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
 *release);
 +void reclaimer_worker(ChunkHead *head);
 +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release);
 +void qemu_reclaimer(void);

 So enqueue is call_rcu and qemu_reclaimer marks a quiescent state +
 empties the pending call_rcu.

 But what's the difference between the two pairs of APIs?

I add the new one for V2 to reclaim the resource for mem view. And
yes, as you point out, I will delete the 2nd API, for it can be
substituted by the 1st one easily.

 +#endif
 diff --git a/main-loop.c b/main-loop.c
 index eb3b6e6..be9d095 100644
 --- a/main-loop.c
 +++ b/main-loop.c
 @@ -26,6 +26,7 @@
  #include qemu-timer.h
  #include slirp/slirp.h
  #include main-loop.h
 +#include qemu/reclaimer.h

  #ifndef _WIN32

 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
 them.  */
  qemu_bh_poll();

 +/* ref to device from iohandler/bh/timer do not obey the rules, so delay
 + * reclaiming until now.
 + */
 +qemu_reclaimer();
  return ret;
  }
 diff --git a/qemu-tool.c b/qemu-tool.c
 index 318c5fc..f5fe319 100644
 --- a/qemu-tool.c
 +++ b/qemu-tool.c
 @@ -21,6 +21,7 @@
  #include main-loop.h
  #include qemu_socket.h
  #include slirp/libslirp.h
 +#include qemu/reclaimer.h

  #include sys/time.h

 @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void)
  {
  }

 +void qemu_reclaimer(void)
 +{
 +}
 +
  int use_icount;

  void qemu_clock_warp(QEMUClock *clock)
 diff --git a/qom/Makefile.objs b/qom/Makefile.objs
 index 5ef060a..a579261 100644
 --- a/qom/Makefile.objs
 +++ b/qom/Makefile.objs
 @@ -1,4 +1,4 @@
 -qom-obj-y = object.o container.o qom-qobject.o
 +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
  qom-obj-twice-y = cpu.o
  common-obj-y = $(qom-obj-twice-y)
  user-obj-y = $(qom-obj-twice-y)
 diff --git a/qom/reclaimer.c b/qom/reclaimer.c
 new file mode 100644
 index 000..6cb53e3
 --- /dev/null
 +++ b/qom/reclaimer.c
 @@ -0,0 +1,58 @@
 +/*
 + * QEMU reclaimer
 + *
 + * Copyright IBM, Corp. 2012
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#include qemu-common.h
 +#include qemu-thread.h
 +#include main-loop.h
 +#include qemu-queue.h
 +#include qemu/reclaimer.h
 +
 +static struct QemuMutex reclaimer_lock;
 +static QLIST_HEAD(rcl, Chunk) reclaimer_list;
 +
 +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
 *release)
 +{
 +Chunk *r = g_malloc0(sizeof(Chunk));
 +r-opaque = opaque;
 +r-release = release;
 +QLIST_INSERT_HEAD_RCU(head, r, list);
 +}

 No lock?
Yes, need!  I will think it more closely.

Thanks and regards,
pingfan

 +void reclaimer_worker(ChunkHead *head)
 +{
 +Chunk *cur, *next;
 +
 +QLIST_FOREACH_SAFE(cur, head, list, next) {
 +QLIST_REMOVE(cur, list);
 +cur-release(cur-opaque);
 +g_free(cur);
 +}

 QLIST_REMOVE needs a lock too, so using the lockless
 QLIST_INSERT_HEAD_RCU is not necessary.

 +}
 +
 +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release)
 +{
 +Chunk *r = g_malloc0(sizeof(Chunk));
 +r-opaque = opaque;
 +r-release = release;
 +qemu_mutex_lock(reclaimer_lock);
 +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list);
 +qemu_mutex_unlock(reclaimer_lock);
 +}
 +
 +
 +void qemu_reclaimer(void)
 +{
 +Chunk *cur, *next;
 +
 +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) {
 +QLIST_REMOVE(cur, list);
 +cur-release(cur-opaque);
 +g_free(cur);
 +}

 Same here

[PATCH 02/15] qom: using atomic ops to re-implement object_ref

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/object.h |3 ++-
 qom/object.c  |   13 +
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/include/qemu/object.h b/include/qemu/object.h
index 8b17776..58db9d0 100644
--- a/include/qemu/object.h
+++ b/include/qemu/object.h
@@ -18,6 +18,7 @@
 #include stdint.h
 #include stdbool.h
 #include qemu-queue.h
+#include qemu/atomic.h
 
 struct Visitor;
 struct Error;
@@ -262,7 +263,7 @@ struct Object
 ObjectClass *class;
 GSList *interfaces;
 QTAILQ_HEAD(, ObjectProperty) properties;
-uint32_t ref;
+Atomic ref;
 Object *parent;
 };
 
diff --git a/qom/object.c b/qom/object.c
index 00bb3b0..822bdb7 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -378,7 +378,7 @@ void object_finalize(void *data)
 object_deinit(obj, ti);
 object_property_del_all(obj);
 
-g_assert(obj-ref == 0);
+g_assert(atomic_read(obj-ref) == 0);
 }
 
 Object *object_new_with_type(Type type)
@@ -405,7 +405,7 @@ Object *object_new(const char *typename)
 void object_delete(Object *obj)
 {
 object_unref(obj);
-g_assert(obj-ref == 0);
+g_assert(atomic_read(obj-ref) == 0);
 g_free(obj);
 }
 
@@ -639,16 +639,13 @@ GSList *object_class_get_list(const char *implements_type,
 
 void object_ref(Object *obj)
 {
-obj-ref++;
+atomic_inc(obj-ref);
 }
 
 void object_unref(Object *obj)
 {
-g_assert(obj-ref  0);
-obj-ref--;
-
-/* parent always holds a reference to its children */
-if (obj-ref == 0) {
+g_assert(atomic_read(obj-ref)  0);
+if (atomic_dec_and_test(obj-ref)) {
 object_finalize(obj);
 }
 }
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Collect unused object and release them at caller demand.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/reclaimer.h |   28 ++
 main-loop.c  |5 
 qemu-tool.c  |5 
 qom/Makefile.objs|2 +-
 qom/reclaimer.c  |   58 ++
 5 files changed, 97 insertions(+), 1 deletions(-)
 create mode 100644 include/qemu/reclaimer.h
 create mode 100644 qom/reclaimer.c

diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
new file mode 100644
index 000..9307e93
--- /dev/null
+++ b/include/qemu/reclaimer.h
@@ -0,0 +1,28 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_RECLAIMER
+#define QEMU_RECLAIMER
+
+typedef void ReleaseHandler(void *opaque);
+typedef struct Chunk {
+QLIST_ENTRY(Chunk) list;
+void *opaque;
+ReleaseHandler *release;
+} Chunk;
+
+typedef struct ChunkHead {
+struct Chunk *lh_first;
+} ChunkHead;
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release);
+void reclaimer_worker(ChunkHead *head);
+void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release);
+void qemu_reclaimer(void);
+#endif
diff --git a/main-loop.c b/main-loop.c
index eb3b6e6..be9d095 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -26,6 +26,7 @@
 #include qemu-timer.h
 #include slirp/slirp.h
 #include main-loop.h
+#include qemu/reclaimer.h
 
 #ifndef _WIN32
 
@@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
them.  */
 qemu_bh_poll();
 
+/* ref to device from iohandler/bh/timer do not obey the rules, so delay
+ * reclaiming until now.
+ */
+qemu_reclaimer();
 return ret;
 }
diff --git a/qemu-tool.c b/qemu-tool.c
index 318c5fc..f5fe319 100644
--- a/qemu-tool.c
+++ b/qemu-tool.c
@@ -21,6 +21,7 @@
 #include main-loop.h
 #include qemu_socket.h
 #include slirp/libslirp.h
+#include qemu/reclaimer.h
 
 #include sys/time.h
 
@@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void)
 {
 }
 
+void qemu_reclaimer(void)
+{
+}
+
 int use_icount;
 
 void qemu_clock_warp(QEMUClock *clock)
diff --git a/qom/Makefile.objs b/qom/Makefile.objs
index 5ef060a..a579261 100644
--- a/qom/Makefile.objs
+++ b/qom/Makefile.objs
@@ -1,4 +1,4 @@
-qom-obj-y = object.o container.o qom-qobject.o
+qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
 qom-obj-twice-y = cpu.o
 common-obj-y = $(qom-obj-twice-y)
 user-obj-y = $(qom-obj-twice-y)
diff --git a/qom/reclaimer.c b/qom/reclaimer.c
new file mode 100644
index 000..6cb53e3
--- /dev/null
+++ b/qom/reclaimer.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include qemu-common.h
+#include qemu-thread.h
+#include main-loop.h
+#include qemu-queue.h
+#include qemu/reclaimer.h
+
+static struct QemuMutex reclaimer_lock;
+static QLIST_HEAD(rcl, Chunk) reclaimer_list;
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release)
+{
+Chunk *r = g_malloc0(sizeof(Chunk));
+r-opaque = opaque;
+r-release = release;
+QLIST_INSERT_HEAD_RCU(head, r, list);
+}
+
+void reclaimer_worker(ChunkHead *head)
+{
+Chunk *cur, *next;
+
+QLIST_FOREACH_SAFE(cur, head, list, next) {
+QLIST_REMOVE(cur, list);
+cur-release(cur-opaque);
+g_free(cur);
+}
+}
+
+void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release)
+{
+Chunk *r = g_malloc0(sizeof(Chunk));
+r-opaque = opaque;
+r-release = release;
+qemu_mutex_lock(reclaimer_lock);
+QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list);
+qemu_mutex_unlock(reclaimer_lock);
+}
+
+
+void qemu_reclaimer(void)
+{
+Chunk *cur, *next;
+
+QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) {
+QLIST_REMOVE(cur, list);
+cur-release(cur-opaque);
+g_free(cur);
+}
+}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Using mem_map_lock to protect among updaters. So we can get the intact
snapshot of mem topology -- FlatView  radix-tree.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c   |3 +++
 memory.c |   22 ++
 memory.h |2 ++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 8244d54..0e29ef9 100644
--- a/exec.c
+++ b/exec.c
@@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
The bottom level has pointers to MemoryRegionSections.  */
 static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
 
+QemuMutex mem_map_lock;
+
 static void io_mem_init(void);
 static void memory_map_init(void);
 
@@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
 #if !defined(CONFIG_USER_ONLY)
 memory_map_init();
 io_mem_init();
+qemu_mutex_init(mem_map_lock);
 #endif
 }
 
diff --git a/memory.c b/memory.c
index aab4a31..5986532 100644
--- a/memory.c
+++ b/memory.c
@@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
 assert(memory_region_transaction_depth);
 --memory_region_transaction_depth;
 if (!memory_region_transaction_depth  memory_region_update_pending) {
+qemu_mutex_lock(mem_map_lock);
 memory_region_update_topology(NULL);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, 
unsigned client)
 {
 uint8_t mask = 1  client;
 
+qemu_mutex_lock(mem_map_lock);
 mr-dirty_log_mask = (mr-dirty_log_mask  ~mask) | (log * mask);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
@@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
 void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
 {
 if (mr-readonly != readonly) {
+qemu_mutex_lock(mem_map_lock);
 mr-readonly = readonly;
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion 
*mr, bool readable)
 {
 if (mr-readable != readable) {
 mr-readable = readable;
+qemu_mutex_lock(mem_map_lock);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 }
 
@@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
 };
 unsigned i;
 
+qemu_mutex_lock(mem_map_lock);
 for (i = 0; i  mr-ioeventfd_nb; ++i) {
 if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) {
 break;
@@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
 sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i));
 mr-ioeventfds[i] = mrfd;
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_del_eventfd(MemoryRegion *mr,
@@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
 };
 unsigned i;
 
+qemu_mutex_lock(mem_map_lock);
 for (i = 0; i  mr-ioeventfd_nb; ++i) {
 if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) {
 break;
@@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
 mr-ioeventfds = g_realloc(mr-ioeventfds,
   sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 
1);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 static void memory_region_add_subregion_common(MemoryRegion *mr,
@@ -1259,6 +1271,8 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 assert(!subregion-parent);
 subregion-parent = mr;
 subregion-addr = offset;
+
+qemu_mutex_lock(mem_map_lock);
 QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
 if (subregion-may_overlap || other-may_overlap) {
 continue;
@@ -1289,6 +1303,7 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link);
 done:
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 
@@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
 {
 assert(subregion-parent == mr);
 subregion-parent = NULL;
+
+qemu_mutex_lock(mem_map_lock);
 QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
 memory_region_update_topology(mr);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_set_enabled(MemoryRegion *mr, bool enabled)
@@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr, bool 
enabled)
 if (enabled == mr-enabled) {
 return;
 }
+qemu_mutex_lock(mem_map_lock);
 mr-enabled = enabled;
 memory_region_update_topology(NULL);
+qemu_mutex_unlock(mem_map_lock);
 }
 
 void memory_region_set_address(MemoryRegion *mr, target_phys_addr_t

[PATCH 0/15 v2] prepare unplug out of protection of global lock

2012-08-08 Thread Liu Ping Fan
background:
refer to orignal plan posted by Marcelo Tosatti,
http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html

prev version:
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03312.html


changes v1-v2

--introduce atomic ops
--introduce ref cnt for MemoryRegion
--make memory's flat view and radix tree toward rcu style.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/15] memory: introduce life_ops to MemoryRegion

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

The types of referred object by MemoryRegion are variable, ex,
another mr, DeviceState, or other struct defined by drivers.
So the refer/unrefer may be different by drivers.

Using this ops, we can mange the backend object.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/ide/piix.c |6 ++--
 hw/pckbd.c|6 +++-
 hw/serial.c   |2 +-
 ioport.c  |3 +-
 memory.c  |   69 -
 memory.h  |   16 +
 6 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/hw/ide/piix.c b/hw/ide/piix.c
index f5a74c2..bdd70b1 100644
--- a/hw/ide/piix.c
+++ b/hw/ide/piix.c
@@ -93,11 +93,11 @@ static void bmdma_setup_bar(PCIIDEState *d)
 for(i = 0;i  2; i++) {
 BMDMAState *bm = d-bmdma[i];
 
-memory_region_init_io(bm-extra_io, piix_bmdma_ops, bm,
+memory_region_init_io_ext(bm-extra_io, piix_bmdma_ops, NULL, bm,
   piix-bmdma, 4);
 memory_region_add_subregion(d-bmdma_bar, i * 8, bm-extra_io);
-memory_region_init_io(bm-addr_ioport, bmdma_addr_ioport_ops, bm,
-  bmdma, 4);
+memory_region_init_io_ext(bm-addr_ioport, bmdma_addr_ioport_ops,
+  NULL, bm, bmdma, 4);
 memory_region_add_subregion(d-bmdma_bar, i * 8 + 4, 
bm-addr_ioport);
 }
 }
diff --git a/hw/pckbd.c b/hw/pckbd.c
index 69857ba..de3c46d 100644
--- a/hw/pckbd.c
+++ b/hw/pckbd.c
@@ -485,10 +485,12 @@ static int i8042_initfn(ISADevice *dev)
 isa_init_irq(dev, s-irq_kbd, 1);
 isa_init_irq(dev, s-irq_mouse, 12);
 
-memory_region_init_io(isa_s-io + 0, i8042_data_ops, s, i8042-data, 1);
+memory_region_init_io_ext(isa_s-io + 0, i8042_data_ops, NULL, s,
+i8042-data, 1);
 isa_register_ioport(dev, isa_s-io + 0, 0x60);
 
-memory_region_init_io(isa_s-io + 1, i8042_cmd_ops, s, i8042-cmd, 1);
+memory_region_init_io_ext(isa_s-io + 1, i8042_cmd_ops, NULL, s,
+i8042-cmd, 1);
 isa_register_ioport(dev, isa_s-io + 1, 0x64);
 
 s-kbd = ps2_kbd_init(kbd_update_kbd_irq, s);
diff --git a/hw/serial.c b/hw/serial.c
index a421d1e..e992c6a 100644
--- a/hw/serial.c
+++ b/hw/serial.c
@@ -794,7 +794,7 @@ static int serial_isa_initfn(ISADevice *dev)
 serial_init_core(s);
 qdev_set_legacy_instance_id(dev-qdev, isa-iobase, 3);
 
-memory_region_init_io(s-io, serial_io_ops, s, serial, 8);
+memory_region_init_io_ext(s-io, serial_io_ops, NULL, s, serial, 8);
 isa_register_ioport(dev, s-io, isa-iobase);
 return 0;
 }
diff --git a/ioport.c b/ioport.c
index 6e4ca0d..768e271 100644
--- a/ioport.c
+++ b/ioport.c
@@ -384,7 +384,8 @@ static void portio_list_add_1(PortioList *piolist,
  * Use an alias so that the callback is called with an absolute address,
  * rather than an offset relative to to start + off_low.
  */
-memory_region_init_io(region, ops, piolist-opaque, piolist-name,
+memory_region_init_io_ext(region, ops, NULL, piolist-opaque,
+  piolist-name,
   INT64_MAX);
 memory_region_init_alias(alias, piolist-name,
  region, start + off_low, off_high - off_low);
diff --git a/memory.c b/memory.c
index 5986532..80c7529 100644
--- a/memory.c
+++ b/memory.c
@@ -19,6 +19,7 @@
 #include bitops.h
 #include kvm.h
 #include assert.h
+#include hw/qdev.h
 
 #define WANT_EXEC_OBSOLETE
 #include exec-obsolete.h
@@ -799,6 +800,7 @@ static bool memory_region_wrong_endianness(MemoryRegion *mr)
 #endif
 }
 
+static MemoryRegionLifeOps nops;
 void memory_region_init(MemoryRegion *mr,
 const char *name,
 uint64_t size)
@@ -809,6 +811,7 @@ void memory_region_init(MemoryRegion *mr,
 if (size == UINT64_MAX) {
 mr-size = int128_2_64();
 }
+mr-life_ops = nops;
 mr-addr = 0;
 mr-subpage = false;
 mr-enabled = true;
@@ -931,6 +934,66 @@ static void memory_region_dispatch_write(MemoryRegion *mr,
   memory_region_write_accessor, mr);
 }
 
+static void mr_object_get(MemoryRegion *mr)
+{
+object_dynamic_cast_assert(OBJECT(mr-opaque), TYPE_DEVICE);
+object_ref(OBJECT(mr-opaque));
+}
+
+static void mr_object_put(MemoryRegion *mr)
+{
+object_unref(OBJECT(mr-opaque));
+}
+
+static MemoryRegionLifeOps obj_ops = {
+.get = mr_object_get,
+.put = mr_object_put,
+};
+
+static void mr_alias_get(MemoryRegion *mr)
+{
+}
+
+static void mr_alias_put(MemoryRegion *mr)
+{
+}
+
+static MemoryRegionLifeOps alias_ops = {
+.get = mr_alias_get,
+.put = mr_alias_put,
+};
+
+static void mr_nop_get(MemoryRegion *mr)
+{
+}
+
+static void mr_nop_put(MemoryRegion *mr)
+{
+}
+
+static MemoryRegionLifeOps nops = {
+.get = mr_nop_get,
+.put = mr_nop_put,
+};
+
+void memory_region_init_io_ext(MemoryRegion

[PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Using refcnt for mr, so we can separate mr's life cycle management
from refered object.
  When mr-ref 0-1, inc the refered object.
  When mr-ref 1-0, dec the refered object.

The refered object can be DeviceStae, another mr, or other opaque.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 memory.c |   18 ++
 memory.h |5 +
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index 80c7529..5dc8b59 100644
--- a/memory.c
+++ b/memory.c
@@ -811,6 +811,7 @@ void memory_region_init(MemoryRegion *mr,
 if (size == UINT64_MAX) {
 mr-size = int128_2_64();
 }
+atomic_set(mr-ref, 0);
 mr-life_ops = nops;
 mr-addr = 0;
 mr-subpage = false;
@@ -1090,6 +1091,23 @@ static const MemoryRegionOps reservation_ops = {
 .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+void memory_region_get(MemoryRegion *mr)
+{
+if (atomic_add_and_return(1, mr-ref) == 1) {
+mr-life_ops-get(mr);
+}
+}
+
+void memory_region_put(MemoryRegion *mr)
+{
+assert(atomic_read(mr-ref)  0);
+
+if (atomic_dec_and_test(mr-ref)) {
+/* to fix, using call_rcu( ,release) */
+mr-life_ops-put(mr);
+}
+}
+
 void memory_region_init_reservation(MemoryRegion *mr,
 const char *name,
 uint64_t size)
diff --git a/memory.h b/memory.h
index 8fb543b..740f018 100644
--- a/memory.h
+++ b/memory.h
@@ -18,6 +18,7 @@
 
 #include stdint.h
 #include stdbool.h
+#include qemu/atomic.h
 #include qemu-common.h
 #include cpu-common.h
 #include targphys.h
@@ -26,6 +27,7 @@
 #include ioport.h
 #include int128.h
 #include qemu-thread.h
+#include qemu/reclaimer.h
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
@@ -126,6 +128,7 @@ typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
 struct MemoryRegion {
 /* All fields are private - violators will be prosecuted */
 const MemoryRegionOps *ops;
+Atomic ref;
 MemoryRegionLifeOps *life_ops;
 void *opaque;
 MemoryRegion *parent;
@@ -766,6 +769,8 @@ void memory_global_dirty_log_stop(void);
 
 void mtree_info(fprintf_function mon_printf, void *f);
 
+void memory_region_get(MemoryRegion *mr);
+void memory_region_put(MemoryRegion *mr);
 #endif
 
 #endif
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/15] atomic: introduce atomic operations

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

If out of global lock, we will be challenged by SMP in low level,
so need atomic ops.

This file is heavily copied from kernel. Currently, only x86 atomic ops
included, and will be extended for other arch for future.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/qemu/atomic.h |  161 +
 1 files changed, 161 insertions(+), 0 deletions(-)
 create mode 100644 include/qemu/atomic.h

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
new file mode 100644
index 000..8e1fc3e
--- /dev/null
+++ b/include/qemu/atomic.h
@@ -0,0 +1,161 @@
+/*
+ * Simple interface for atomic operations.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H 1
+
+typedef struct Atomic {
+int counter;
+} Atomic;
+
+
+#if defined(__i386__) || defined(__x86_64__)
+
+/**
+ *  * atomic_read - read atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically reads the value of @v.
+ *  */
+static inline int atomic_read(const Atomic *v)
+{
+return (*(volatile int *)(v)-counter);
+}
+
+/**
+ *  * atomic_set - set atomic variable
+ *   * @v: pointer of type Atomic
+ ** @i: required value
+ * *
+ *  * Atomically sets the value of @v to @i.
+ *   */
+static inline void atomic_set(Atomic *v, int i)
+{
+v-counter = i;
+}
+
+/**
+ *  * atomic_add - add integer to atomic variable
+ *   * @i: integer value to add
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically adds @i to @v.
+ *   */
+static inline void atomic_add(int i, Atomic *v)
+{
+asm volatile(lock; addl %1,%0
+ : +m (v-counter)
+ : ir (i));
+}
+
+/**
+ *  * atomic_sub - subtract integer from atomic variable
+ *   * @i: integer value to subtract
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically subtracts @i from @v.
+ *   */
+static inline void atomic_sub(int i, Atomic *v)
+{
+asm volatile(lock; subl %1,%0
+ : +m (v-counter)
+ : ir (i));
+}
+
+/**
+ *  * atomic_sub_and_test - subtract value from variable and test result
+ *   * @i: integer value to subtract
+ ** @v: pointer of type Atomic
+ * *
+ *  * Atomically subtracts @i from @v and returns
+ *   * true if the result is zero, or false for all
+ ** other cases.
+ * */
+static inline int atomic_sub_and_test(int i, Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; subl %2,%0; sete %1
+ : +m (v-counter), =qm (c)
+ : ir (i) : memory);
+return c;
+}
+
+/**
+ *  * atomic_inc - increment atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically increments @v by 1.
+ *  */
+static inline void atomic_inc(Atomic *v)
+{
+asm volatile(lock; incl %0
+ : +m (v-counter));
+}
+
+/**
+ *  * atomic_dec - decrement atomic variable
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically decrements @v by 1.
+ *  */
+static inline void atomic_dec(Atomic *v)
+{
+asm volatile(lock; decl %0
+ : +m (v-counter));
+}
+
+/**
+ *  * atomic_dec_and_test - decrement and test
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically decrements @v by 1 and
+ *  * returns true if the result is 0, or false for all other
+ *   * cases.
+ **/
+static inline int atomic_dec_and_test(Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; decl %0; sete %1
+ : +m (v-counter), =qm (c)
+ : : memory);
+return c != 0;
+}
+
+/**
+ *  * atomic_inc_and_test - increment and test
+ *   * @v: pointer of type Atomic
+ **
+ * * Atomically increments @v by 1
+ *  * and returns true if the result is zero, or false for all
+ *   * other cases.
+ **/
+static inline int atomic_inc_and_test(Atomic *v)
+{
+unsigned char c;
+
+asm volatile(lock; incl %0; sete %1
+ : +m (v-counter), =qm (c)
+ : : memory);
+return c != 0;
+}
+
+static inline int atomic_add_and_return(int i, Atomic *v)
+{
+int ret = i;
+
+asm volatile (lock; xaddl %0, %1
+: +r (ret), +m (v-counter)
+: : memory, cc);
+
+return ret + i;
+}
+#endif
+
+#endif
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/15] memory: inc/dec mr's ref when adding/removing from mem view

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

memory_region_{add,del}_subregion will inc/dec mr's refcnt.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 memory.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/memory.c b/memory.c
index 5dc8b59..2eaa2fc 100644
--- a/memory.c
+++ b/memory.c
@@ -1356,7 +1356,7 @@ static void 
memory_region_add_subregion_common(MemoryRegion *mr,
 assert(!subregion-parent);
 subregion-parent = mr;
 subregion-addr = offset;
-
+memory_region_get(subregion);
 qemu_mutex_lock(mem_map_lock);
 QTAILQ_FOREACH(other, mr-subregions, subregions_link) {
 if (subregion-may_overlap || other-may_overlap) {
@@ -1420,6 +1420,8 @@ void memory_region_del_subregion(MemoryRegion *mr,
 qemu_mutex_lock(mem_map_lock);
 QTAILQ_REMOVE(mr-subregions, subregion, subregions_link);
 memory_region_update_topology(mr);
+/* mr may be still in use by reader of radix, must delay to release */
+memory_region_put(subregion);
 qemu_mutex_unlock(mem_map_lock);
 }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

PhysMap contain the flatview and radix-tree view, they are snapshot
of system topology and should be consistent. With PhysMap, we can
swap the pointer when updating and achieve the atomic.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c   |8 
 memory.c |   33 -
 memory.h |   62 --
 3 files changed, 60 insertions(+), 43 deletions(-)

diff --git a/exec.c b/exec.c
index 0e29ef9..01b91b0 100644
--- a/exec.c
+++ b/exec.c
@@ -156,8 +156,6 @@ typedef struct PageDesc {
 #endif
 
 /* Size of the L2 (and L3, etc) page tables.  */
-#define L2_BITS 10
-#define L2_SIZE (1  L2_BITS)
 
 #define P_L2_LEVELS \
 (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1)
@@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask;
 static void *l1_map[V_L1_SIZE];
 
 #if !defined(CONFIG_USER_ONLY)
-typedef struct PhysPageEntry PhysPageEntry;
 
 static MemoryRegionSection *phys_sections;
 static unsigned phys_sections_nb, phys_sections_nb_alloc;
@@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty;
 static uint16_t phys_section_rom;
 static uint16_t phys_section_watch;
 
-struct PhysPageEntry {
-uint16_t is_leaf : 1;
- /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
-uint16_t ptr : 15;
-};
 
 /* Simple allocator for PhysPageEntry nodes */
 static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
diff --git a/memory.c b/memory.c
index 2eaa2fc..c7f2cfd 100644
--- a/memory.c
+++ b/memory.c
@@ -31,17 +31,6 @@ static bool global_dirty_log = false;
 static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners
 = QTAILQ_HEAD_INITIALIZER(memory_listeners);
 
-typedef struct AddrRange AddrRange;
-
-/*
- * Note using signed integers limits us to physical addresses at most
- * 63 bits wide.  They are needed for negative offsetting in aliases
- * (large MemoryRegion::alias_offset).
- */
-struct AddrRange {
-Int128 start;
-Int128 size;
-};
 
 static AddrRange addrrange_make(Int128 start, Int128 size)
 {
@@ -197,28 +186,6 @@ static bool 
memory_region_ioeventfd_equal(MemoryRegionIoeventfd a,
  !memory_region_ioeventfd_before(b, a);
 }
 
-typedef struct FlatRange FlatRange;
-typedef struct FlatView FlatView;
-
-/* Range of memory in the global map.  Addresses are absolute. */
-struct FlatRange {
-MemoryRegion *mr;
-target_phys_addr_t offset_in_region;
-AddrRange addr;
-uint8_t dirty_log_mask;
-bool readable;
-bool readonly;
-};
-
-/* Flattened global view of current active memory hierarchy.  Kept in sorted
- * order.
- */
-struct FlatView {
-FlatRange *ranges;
-unsigned nr;
-unsigned nr_allocated;
-};
-
 typedef struct AddressSpace AddressSpace;
 typedef struct AddressSpaceOps AddressSpaceOps;
 
diff --git a/memory.h b/memory.h
index 740f018..357edd8 100644
--- a/memory.h
+++ b/memory.h
@@ -29,12 +29,72 @@
 #include qemu-thread.h
 #include qemu/reclaimer.h
 
+typedef struct AddrRange AddrRange;
+typedef struct FlatRange FlatRange;
+typedef struct FlatView FlatView;
+typedef struct PhysPageEntry PhysPageEntry;
+typedef struct PhysMap PhysMap;
+typedef struct MemoryRegionSection MemoryRegionSection;
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionLifeOps MemoryRegionLifeOps;
 typedef struct MemoryRegion MemoryRegion;
 typedef struct MemoryRegionPortio MemoryRegionPortio;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
+/*
+ * Note using signed integers limits us to physical addresses at most
+ * 63 bits wide.  They are needed for negative offsetting in aliases
+ * (large MemoryRegion::alias_offset).
+ */
+struct AddrRange {
+Int128 start;
+Int128 size;
+};
+
+/* Range of memory in the global map.  Addresses are absolute. */
+struct FlatRange {
+MemoryRegion *mr;
+target_phys_addr_t offset_in_region;
+AddrRange addr;
+uint8_t dirty_log_mask;
+bool readable;
+bool readonly;
+};
+
+/* Flattened global view of current active memory hierarchy.  Kept in sorted
+ * order.
+ */
+struct FlatView {
+FlatRange *ranges;
+unsigned nr;
+unsigned nr_allocated;
+};
+
+struct PhysPageEntry {
+uint16_t is_leaf:1;
+ /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */
+uint16_t ptr:15;
+};
+
+#define L2_BITS 10
+#define L2_SIZE (1  L2_BITS)
+/* This is a multi-level map on the physical address space.
+   The bottom level has pointers to MemoryRegionSections.  */
+struct PhysMap {
+Atomic ref;
+PhysPageEntry root;
+PhysPageEntry (*phys_map_nodes)[L2_SIZE];
+unsigned phys_map_nodes_nb;
+unsigned phys_map_nodes_nb_alloc;
+
+MemoryRegionSection *phys_sections;
+unsigned phys_sections_nb;
+unsigned phys_sections_nb_alloc;
+
+/* FlatView */
+FlatView views[2];
+};
+
 /* Must match *_DIRTY_FLAGS in cpu-all.h.  To be replaced with dynamic

[PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Flatview and radix view are all under the protection of pointer.
And this make sure the change of them seem to be atomic!

The mr accessed by radix-tree leaf or flatview will be reclaimed
after the prev PhysMap not in use any longer

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c  |  303 +++---
 hw/vhost.c  |2 +-
 hw/xen_pt.c |2 +-
 kvm-all.c   |2 +-
 memory.c|   92 ++-
 memory.h|9 ++-
 vl.c|1 +
 xen-all.c   |2 +-
 8 files changed, 286 insertions(+), 127 deletions(-)

diff --git a/exec.c b/exec.c
index 01b91b0..97addb9 100644
--- a/exec.c
+++ b/exec.c
@@ -24,6 +24,7 @@
 #include sys/mman.h
 #endif
 
+#include qemu/atomic.h
 #include qemu-common.h
 #include cpu.h
 #include tcg.h
@@ -35,6 +36,8 @@
 #include qemu-timer.h
 #include memory.h
 #include exec-memory.h
+#include qemu-thread.h
+#include qemu/reclaimer.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE];
 
 #if !defined(CONFIG_USER_ONLY)
 
-static MemoryRegionSection *phys_sections;
-static unsigned phys_sections_nb, phys_sections_nb_alloc;
 static uint16_t phys_section_unassigned;
 static uint16_t phys_section_notdirty;
 static uint16_t phys_section_rom;
 static uint16_t phys_section_watch;
 
-
-/* Simple allocator for PhysPageEntry nodes */
-static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
-static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
-
 #define PHYS_MAP_NODE_NIL (((uint16_t)~0)  1)
 
-/* This is a multi-level map on the physical address space.
-   The bottom level has pointers to MemoryRegionSections.  */
-static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
-
+static QemuMutex cur_map_lock;
+static PhysMap *cur_map;
 QemuMutex mem_map_lock;
+static PhysMap *next_map;
 
 static void io_mem_init(void);
 static void memory_map_init(void);
@@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 
 #if !defined(CONFIG_USER_ONLY)
 
-static void phys_map_node_reserve(unsigned nodes)
+static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
 {
-if (phys_map_nodes_nb + nodes  phys_map_nodes_nb_alloc) {
+if (map-phys_map_nodes_nb + nodes  map-phys_map_nodes_nb_alloc) {
 typedef PhysPageEntry Node[L2_SIZE];
-phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
-phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
-  phys_map_nodes_nb + nodes);
-phys_map_nodes = g_renew(Node, phys_map_nodes,
- phys_map_nodes_nb_alloc);
+map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2,
+16);
+map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc,
+  map-phys_map_nodes_nb + nodes);
+map-phys_map_nodes = g_renew(Node, map-phys_map_nodes,
+ map-phys_map_nodes_nb_alloc);
 }
 }
 
-static uint16_t phys_map_node_alloc(void)
+static uint16_t phys_map_node_alloc(PhysMap *map)
 {
 unsigned i;
 uint16_t ret;
 
-ret = phys_map_nodes_nb++;
+ret = map-phys_map_nodes_nb++;
 assert(ret != PHYS_MAP_NODE_NIL);
-assert(ret != phys_map_nodes_nb_alloc);
+assert(ret != map-phys_map_nodes_nb_alloc);
 for (i = 0; i  L2_SIZE; ++i) {
-phys_map_nodes[ret][i].is_leaf = 0;
-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
+map-phys_map_nodes[ret][i].is_leaf = 0;
+map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
 }
 return ret;
 }
 
-static void phys_map_nodes_reset(void)
-{
-phys_map_nodes_nb = 0;
-}
-
-
-static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index,
-target_phys_addr_t *nb, uint16_t leaf,
+static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp,
+target_phys_addr_t *index,
+target_phys_addr_t *nb,
+uint16_t leaf,
 int level)
 {
 PhysPageEntry *p;
@@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, 
target_phys_addr_t *index,
 target_phys_addr_t step = (target_phys_addr_t)1  (level * L2_BITS);
 
 if (!lp-is_leaf  lp-ptr == PHYS_MAP_NODE_NIL) {
-lp-ptr = phys_map_node_alloc();
-p = phys_map_nodes[lp-ptr];
+lp-ptr = phys_map_node_alloc(map);
+p = map-phys_map_nodes[lp-ptr];
 if (level == 0) {
 for (i = 0; i  L2_SIZE; i++) {
 p[i].is_leaf = 1;
@@ -434,7 +426,7 @@ static void phys_page_set_level(PhysPageEntry *lp, 
target_phys_addr_t *index,
 }
 }
 } else

[PATCH 10/15] memory: change tcg related code to using PhysMap

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Change tcg code to use PhysMap.
This is separated from the prev patch for review purpose. Should be
merged into prev one.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 exec.c |   27 +--
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 97addb9..8d0dea5 100644
--- a/exec.c
+++ b/exec.c
@@ -1923,6 +1923,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
 {
 target_phys_addr_t iotlb;
 CPUWatchpoint *wp;
+PhysMap *map = cur_map_get();
 
 if (memory_region_is_ram(section-mr)) {
 /* Normal RAM.  */
@@ -1940,7 +1941,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
and avoid full address decoding in every device.
We can't use the high bits of pd for this because
IO_MEM_ROMD uses these as a ram address.  */
-iotlb = section - phys_sections;
+iotlb = section - map-phys_sections;
 iotlb += memory_region_section_addr(section, paddr);
 }
 
@@ -1956,6 +1957,7 @@ target_phys_addr_t 
memory_region_section_get_iotlb(CPUArchState *env,
 }
 }
 }
+physmap_put(map);
 
 return iotlb;
 }
@@ -3185,7 +3187,12 @@ static uint16_t dummy_section(PhysMap *map, MemoryRegion 
*mr)
 
 MemoryRegion *iotlb_to_region(target_phys_addr_t index)
 {
-return phys_sections[index  ~TARGET_PAGE_MASK].mr;
+MemoryRegion *ret;
+PhysMap *map = cur_map_get();
+
+ret = map-phys_sections[index  ~TARGET_PAGE_MASK].mr;
+physmap_put(map);
+return ret;
 }
 
 static void io_mem_init(void)
@@ -3946,13 +3953,14 @@ void stl_phys_notdirty(target_phys_addr_t addr, 
uint32_t val)
 {
 uint8_t *ptr;
 MemoryRegionSection *section;
+PhysMap *map = cur_map_get();
 
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 io_mem_write(section-mr, addr, val, 4);
 } else {
@@ -3972,19 +3980,21 @@ void stl_phys_notdirty(target_phys_addr_t addr, 
uint32_t val)
 }
 }
 }
+physmap_put(map);
 }
 
 void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val)
 {
 uint8_t *ptr;
 MemoryRegionSection *section;
+PhysMap *map = cur_map_get();
 
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #ifdef TARGET_WORDS_BIGENDIAN
 io_mem_write(section-mr, addr, val  32, 4);
@@ -3999,6 +4009,7 @@ void stq_phys_notdirty(target_phys_addr_t addr, uint64_t 
val)
+ memory_region_section_addr(section, addr));
 stq_p(ptr, val);
 }
+physmap_put(map);
 }
 
 /* warning: addr must be aligned */
@@ -4008,12 +4019,13 @@ static inline void stl_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 uint8_t *ptr;
 MemoryRegionSection *section;
 
+PhysMap *map = cur_map_get();
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #if defined(TARGET_WORDS_BIGENDIAN)
 if (endian == DEVICE_LITTLE_ENDIAN) {
@@ -4050,6 +4062,7 @@ static inline void stl_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 (0xff  ~CODE_DIRTY_FLAG));
 }
 }
+physmap_put(map);
 }
 
 void stl_phys(target_phys_addr_t addr, uint32_t val)
@@ -4081,12 +4094,13 @@ static inline void stw_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 uint8_t *ptr;
 MemoryRegionSection *section;
 
+PhysMap *map = cur_map_get();
 section = phys_page_find(addr  TARGET_PAGE_BITS);
 
 if (!memory_region_is_ram(section-mr) || section-readonly) {
 addr = memory_region_section_addr(section, addr);
 if (memory_region_is_ram(section-mr)) {
-section = phys_sections[phys_section_rom];
+section = map-phys_sections[phys_section_rom];
 }
 #if defined(TARGET_WORDS_BIGENDIAN)
 if (endian == DEVICE_LITTLE_ENDIAN) {
@@ -4123,6 +4137,7 @@ static inline void stw_phys_internal(target_phys_addr_t 
addr, uint32_t val,
 (0xff  ~CODE_DIRTY_FLAG));
 }
 }
+physmap_put(map);
 }
 
 void

[PATCH 11/15] lock: introduce global lock for device tree

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 cpus.c  |   12 
 main-loop.h |3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..a734b36 100644
--- a/cpus.c
+++ b/cpus.c
@@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void)
 }
 #endif /* _WIN32 */
 
+QemuMutex qemu_device_tree_mutex;
 QemuMutex qemu_global_mutex;
 static QemuCond qemu_io_proceeded_cond;
 static bool iothread_requesting_mutex;
@@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void)
 qemu_cond_init(qemu_work_cond);
 qemu_cond_init(qemu_io_proceeded_cond);
 qemu_mutex_init(qemu_global_mutex);
+qemu_mutex_init(qemu_device_tree_mutex);
 
 qemu_thread_get_self(io_thread);
 }
@@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void)
 qemu_mutex_unlock(qemu_global_mutex);
 }
 
+void qemu_lock_devtree(void)
+{
+qemu_mutex_lock(qemu_device_tree_mutex);
+}
+
+void qemu_unlock_devtree(void)
+{
+qemu_mutex_unlock(qemu_device_tree_mutex);
+}
+
 static int all_vcpus_paused(void)
 {
 CPUArchState *penv = first_cpu;
diff --git a/main-loop.h b/main-loop.h
index dce1cd9..17e959a 100644
--- a/main-loop.h
+++ b/main-loop.h
@@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void);
  */
 void qemu_mutex_unlock_iothread(void);
 
+void qemu_lock_devtree(void);
+void qemu_unlock_devtree(void);
+
 /* internal interfaces */
 
 void qemu_fd_register(int fd);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/15] qdev: using devtree lock to protect device's accessing

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

lock:
  qemu_device_tree_mutex

competitors:
  --device_del(destruction of device will be postphoned until unplug
ack from guest),
  --pci hot-unplug
  --iteration (qdev_reset_all)
  --device_add

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/pci-hotplug.c  |4 
 hw/qdev-monitor.c |   17 -
 hw/qdev.c |2 ++
 3 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c
index e7fb780..33a9dfe 100644
--- a/hw/pci-hotplug.c
+++ b/hw/pci-hotplug.c
@@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const char 
*pci_addr)
 return -1;
 }
 
+qemu_lock_devtree();
 d = pci_find_device(pci_find_root_bus(dom), bus, PCI_DEVFN(slot, 0));
 if (!d) {
 monitor_printf(mon, slot %d empty\n, slot);
+qemu_unlock_devtree();
 return -1;
 }
 
@@ -275,9 +277,11 @@ static int pci_device_hot_remove(Monitor *mon, const char 
*pci_addr)
 if (error_is_set(local_err)) {
 monitor_printf(mon, %s\n, error_get_pretty(local_err));
 error_free(local_err);
+qemu_unlock_devtree();
 return -1;
 }
 
+qemu_unlock_devtree();
 return 0;
 }
 
diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
index 7915b45..2d47fe0 100644
--- a/hw/qdev-monitor.c
+++ b/hw/qdev-monitor.c
@@ -429,14 +429,18 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 
 /* find bus */
 path = qemu_opt_get(opts, bus);
+
+qemu_lock_devtree();
 if (path != NULL) {
 bus = qbus_find(path);
 if (!bus) {
+qemu_unlock_devtree();
 return NULL;
 }
 if (strcmp(object_get_typename(OBJECT(bus)), k-bus_type) != 0) {
 qerror_report(QERR_BAD_BUS_FOR_DEVICE,
   driver, object_get_typename(OBJECT(bus)));
+qemu_unlock_devtree();
 return NULL;
 }
 } else {
@@ -444,11 +448,13 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 if (!bus) {
 qerror_report(QERR_NO_BUS_FOR_DEVICE,
   driver, k-bus_type);
+qemu_unlock_devtree();
 return NULL;
 }
 }
 if (qdev_hotplug  !bus-allow_hotplug) {
 qerror_report(QERR_BUS_NO_HOTPLUG, bus-name);
+qemu_unlock_devtree();
 return NULL;
 }
 
@@ -466,6 +472,7 @@ DeviceState *qdev_device_add(QemuOpts *opts)
 }
 if (qemu_opt_foreach(opts, set_property, qdev, 1) != 0) {
 qdev_free(qdev);
+qemu_unlock_devtree();
 return NULL;
 }
 if (qdev-id) {
@@ -478,6 +485,8 @@ DeviceState *qdev_device_add(QemuOpts *opts)
   OBJECT(qdev), NULL);
 g_free(name);
 }
+qemu_unlock_devtree();
+
 if (qdev_init(qdev)  0) {
 qerror_report(QERR_DEVICE_INIT_FAILED, driver);
 return NULL;
@@ -600,13 +609,19 @@ void qmp_device_del(const char *id, Error **errp)
 {
 DeviceState *dev;
 
+/* protect against unplug ack from guest, where we really remove device
+ * from system
+ */
+qemu_lock_devtree();
 dev = qdev_find_recursive(sysbus_get_default(), id);
 if (NULL == dev) {
 error_set(errp, QERR_DEVICE_NOT_FOUND, id);
+qemu_unlock_devtree();
 return;
 }
-
+/* Just remove from system, and drop refcnt there*/
 qdev_unplug(dev, errp);
+qemu_unlock_devtree();
 }
 
 void qdev_machine_init(void)
diff --git a/hw/qdev.c b/hw/qdev.c
index af54467..17525fe 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -230,7 +230,9 @@ static int qbus_reset_one(BusState *bus, void *opaque)
 
 void qdev_reset_all(DeviceState *dev)
 {
+qemu_lock_devtree();
 qdev_walk_children(dev, qdev_reset_one, qbus_reset_one, NULL);
+qemu_unlock_devtree();
 }
 
 void qbus_reset_all_fn(void *opaque)
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

When guest confirm the removal of device, we should
--unmap from MemoryRegion view
--isolated from device tree view

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/acpi_piix4.c |4 ++--
 hw/pci.c|   13 -
 hw/pci.h|2 ++
 hw/qdev.c   |   28 
 hw/qdev.h   |3 ++-
 5 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 0aace60..c209ff7 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -305,8 +305,8 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned 
slots)
 if (pc-no_hotplug) {
 slot_free = false;
 } else {
-object_unparent(OBJECT(dev));
-qdev_free(qdev);
+/* refcnt will be decreased */
+qdev_unplug_complete(qdev, NULL);
 }
 }
 }
diff --git a/hw/pci.c b/hw/pci.c
index 99a4304..2095abf 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -856,12 +856,22 @@ static int pci_unregister_device(DeviceState *dev)
 if (ret)
 return ret;
 
-pci_unregister_io_regions(pci_dev);
 pci_del_option_rom(pci_dev);
 do_pci_unregister_device(pci_dev);
 return 0;
 }
 
+static void pci_unmap_device(DeviceState *dev)
+{
+PCIDevice *pci_dev = PCI_DEVICE(dev);
+PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
+
+pci_unregister_io_regions(pci_dev);
+if (pc-unmap) {
+pc-unmap(pci_dev);
+}
+}
+
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
   uint8_t type, MemoryRegion *memory)
 {
@@ -2022,6 +2032,7 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 DeviceClass *k = DEVICE_CLASS(klass);
 k-init = pci_qdev_init;
 k-unplug = pci_unplug_device;
+k-unmap = pci_unmap_device;
 k-exit = pci_unregister_device;
 k-bus_type = TYPE_PCI_BUS;
 k-props = pci_props;
diff --git a/hw/pci.h b/hw/pci.h
index 79d38fd..1c5b909 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -145,6 +145,8 @@ typedef struct PCIDeviceClass {
 DeviceClass parent_class;
 
 int (*init)(PCIDevice *dev);
+void (*unmap)(PCIDevice *dev);
+
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
diff --git a/hw/qdev.c b/hw/qdev.c
index 17525fe..530eabe 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -104,6 +104,14 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus)
 bus_add_child(bus, dev);
 }
 
+static void qdev_unset_parent(DeviceState *dev)
+{
+BusState *b = dev-parent_bus;
+
+object_unparent(OBJECT(dev));
+bus_remove_child(b, dev);
+}
+
 /* Create a new device.  This only initializes the device state structure
and allows properties to be set.  qdev_init should be called to
initialize the actual device emulation.  */
@@ -194,6 +202,26 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int 
alias_id,
 dev-alias_required_for_version = required_for_version;
 }
 
+static int qdev_unmap(DeviceState *dev)
+{
+DeviceClass *dc =  DEVICE_GET_CLASS(dev);
+if (dc-unmap) {
+dc-unmap(dev);
+}
+return 0;
+}
+
+void qdev_unplug_complete(DeviceState *dev, Error **errp)
+{
+/* isolate from mem view */
+qdev_unmap(dev);
+qemu_lock_devtree();
+/* isolate from device tree */
+qdev_unset_parent(dev);
+qemu_unlock_devtree();
+object_unref(OBJECT(dev));
+}
+
 void qdev_unplug(DeviceState *dev, Error **errp)
 {
 DeviceClass *dc = DEVICE_GET_CLASS(dev);
diff --git a/hw/qdev.h b/hw/qdev.h
index f4683dc..705635a 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -47,7 +47,7 @@ typedef struct DeviceClass {
 
 /* callbacks */
 void (*reset)(DeviceState *dev);
-
+void (*unmap)(DeviceState *dev);
 /* device state */
 const VMStateDescription *vmsd;
 
@@ -162,6 +162,7 @@ void qdev_init_nofail(DeviceState *dev);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
  int required_for_version);
 void qdev_unplug(DeviceState *dev, Error **errp);
+void qdev_unplug_complete(DeviceState *dev, Error **errp);
 void qdev_free(DeviceState *dev);
 int qdev_simple_unplug_cb(DeviceState *dev);
 void qdev_machine_creation_done(void);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/15] qom: object_unref call reclaimer

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

iohandler/bh/timer may use DeviceState when its refcnt=0,
postpone the reclaimer till they have done with it.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 qom/object.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/qom/object.c b/qom/object.c
index 822bdb7..1452b1b 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -23,6 +23,8 @@
 #include qbool.h
 #include qint.h
 #include qstring.h
+#include hw/qdev.h
+#include qemu/reclaimer.h
 
 #define MAX_INTERFACES 32
 
@@ -646,7 +648,12 @@ void object_unref(Object *obj)
 {
 g_assert(atomic_read(obj-ref)  0);
 if (atomic_dec_and_test(obj-ref)) {
-object_finalize(obj);
+/* fixme, maybe introduce obj-finalze to make this more elegant */
+if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) {
+qemu_reclaimer_enqueue(obj, object_finalize);
+} else {
+object_finalize(obj);
+}
 }
 }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-08 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/e1000.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/hw/e1000.c b/hw/e1000.c
index 4573f13..fa71455 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -1192,6 +1192,13 @@ e1000_cleanup(VLANClientState *nc)
 s-nic = NULL;
 }
 
+static void
+pci_e1000_unmap(PCIDevice *p)
+{
+/* DO NOT FREE anything!until refcnt=0 */
+/* isolate from memory view */
+}
+
 static int
 pci_e1000_uninit(PCIDevice *dev)
 {
@@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void 
*data)
 PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
 
 k-init = pci_e1000_init;
+k-unmap = pci_e1000_unmap;
 k-exit = pci_e1000_uninit;
 k-romfile = pxe-e1000.rom;
 k-vendor_id = PCI_VENDOR_ID_INTEL;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it

2012-07-26 Thread liu ping fan
On Wed, Jul 25, 2012 at 5:08 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/07/2012 05:31, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 rwlock:
   qemu_device_tree_mutex

 rd side:
   --device_del(destruction of device will be postphoned until unplug
 ack from guest),
   --pci hot-unplug
   --iteration (qdev_reset_all)

 wr side:
   --device_add

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  hw/pci-hotplug.c  |4 
  hw/qdev-monitor.c |   17 -
  hw/qdev.c |2 ++
  3 files changed, 22 insertions(+), 1 deletions(-)

 diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c
 index e7fb780..b3b88c1 100644
 --- a/hw/pci-hotplug.c
 +++ b/hw/pci-hotplug.c
 @@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const 
 char *pci_addr)
  return -1;
  }

 +qemu_rwlock_rdlock_devtree();

 This is not defined anywhere, is a piece missing in the patch?

Oh, yes, I miss the patch.  In that patch, these rwlock are just place holder.
I see there is already try to implement rwlock for qemu.
http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg00192.html
and is it the time for introduce rwlock for qemu?

Thanks,
pingfan


 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching

2012-07-26 Thread liu ping fan
On Wed, Jul 25, 2012 at 5:18 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/07/2012 10:12, liu ping fan ha scritto:
  +qemu_rwlock_rdlock_devtree();
   section = phys_page_find(page  TARGET_PAGE_BITS);
  +if (!(memory_region_is_ram(section-mr) ||
  +memory_region_is_romd(section-mr))  !is_write) {
  +bk = get_backend(section-mr, addr);
  +object_ref(bk);
  +} else if (!memory_region_is_ram(section-mr)  is_write) {
  +bk = get_backend(section-mr, addr);
  +object_ref(bk);
  +}
  +qemu_rwlock_unlock_devtree();
 
   if (is_write) {
   if (!memory_region_is_ram(section-mr)) {
  @@ -3426,6 +3462,7 @@ void cpu_physical_memory_rw(target_phys_addr_t 
  addr, uint8_t *buf,
   io_mem_write(section-mr, addr1, val, 1);
   l = 1;
   }
  +object_unref(bk);
 
  Currently object_ref()/object_unref() are not atomic.  Will you send
 We obey the rule:
rdlock-search-ref_get,
wrlock-remove -ref_put
 So can  it causes problem if object_ref()/object_unref()  are not atomic?

 Yes, two CPUs can perform object_ref at the same time.

 You can find a header file for atomic operations here:
 https://github.com/bonzini/qemu/commit/atomics.patch

Got it, thanks

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching

2012-07-26 Thread liu ping fan
On Wed, Jul 25, 2012 at 8:27 PM, Avi Kivity a...@redhat.com wrote:
 On 07/25/2012 01:58 PM, Avi Kivity wrote:
  while (len  0) {
  page = addr  TARGET_PAGE_MASK;
  l = (page + TARGET_PAGE_SIZE) - addr;
  if (l  len)
  l = len;
 +
 +qemu_rwlock_rdlock_devtree();
  section = phys_page_find(page  TARGET_PAGE_BITS);

 Does the devtree lock also protect the data structures accessed by
 phys_page_find()?  Seems wrong.

 The right way is to object_ref() in core_region_add() and object_unref()
 in core_region_del().  We're guaranteed that mr-object is alive during
 _add(), and DeviceClass::unmap() ensures that the extra ref doesn't
 block destruction.

OK, I see. I will try in this way.  But when
memory_region_destroy()-..-core_region_del(), should we reset the
lp.ptr to phys_section_unassigned , otherwise, if using  removed
target_phys_addr_t, we will still get the pointer to invalid
MemoryRegion?

Thanx, pingfan
 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it

2012-07-26 Thread liu ping fan
On Thu, Jul 26, 2012 at 9:00 PM, Avi Kivity a...@redhat.com wrote:
 On 07/26/2012 03:56 PM, liu ping fan wrote:
 On Wed, Jul 25, 2012 at 5:08 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/07/2012 05:31, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 rwlock:
   qemu_device_tree_mutex

 rd side:
   --device_del(destruction of device will be postphoned until unplug
 ack from guest),
   --pci hot-unplug
   --iteration (qdev_reset_all)

 wr side:
   --device_add


 This is not defined anywhere, is a piece missing in the patch?

 Oh, yes, I miss the patch.  In that patch, these rwlock are just place 
 holder.
 I see there is already try to implement rwlock for qemu.
 http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg00192.html
 and is it the time for introduce rwlock for qemu?


 From the description above, I don't see why it can't be a mutex.

Searching in the device tree (or MemoryRegion view) can be often in
parallel, especially in mmio-dispatch code path

Thanx,  pingfan
 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it

2012-07-26 Thread liu ping fan
On Thu, Jul 26, 2012 at 9:15 PM, Avi Kivity a...@redhat.com wrote:
 On 07/26/2012 04:14 PM, liu ping fan wrote:

 From the description above, I don't see why it can't be a mutex.

 Searching in the device tree (or MemoryRegion view) can be often in
 parallel, especially in mmio-dispatch code path

 In mmio dispatch we have a pointer to the object, we don't need to
 search anything.  Is device tree search a hot path?

I think, we need lock to protect searching --phys_page_find()  from
deleter--DeviceClass:unmap,  so rwlock?

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching

2012-07-25 Thread liu ping fan
On Wed, Jul 25, 2012 at 3:43 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Wed, Jul 25, 2012 at 4:31 AM, Liu Ping Fan qemul...@gmail.com wrote:
 @@ -3396,13 +3420,25 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
 uint8_t *buf,
  uint32_t val;
  target_phys_addr_t page;
  MemoryRegionSection *section;
 +Object *bk;

  while (len  0) {
  page = addr  TARGET_PAGE_MASK;
  l = (page + TARGET_PAGE_SIZE) - addr;
  if (l  len)
  l = len;
 +
 +qemu_rwlock_rdlock_devtree();
  section = phys_page_find(page  TARGET_PAGE_BITS);
 +if (!(memory_region_is_ram(section-mr) ||
 +memory_region_is_romd(section-mr))  !is_write) {
 +bk = get_backend(section-mr, addr);
 +object_ref(bk);
 +} else if (!memory_region_is_ram(section-mr)  is_write) {
 +bk = get_backend(section-mr, addr);
 +object_ref(bk);
 +}
 +qemu_rwlock_unlock_devtree();

  if (is_write) {
  if (!memory_region_is_ram(section-mr)) {
 @@ -3426,6 +3462,7 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
 uint8_t *buf,
  io_mem_write(section-mr, addr1, val, 1);
  l = 1;
  }
 +object_unref(bk);

 Currently object_ref()/object_unref() are not atomic.  Will you send

We obey the rule:
   rdlock-search-ref_get,
   wrlock-remove -ref_put
So can  it causes problem if object_ref()/object_unref()  are not atomic?
Thanx, pingfan
 another patch to perform atomic increment/decrement or how will
 per-device synchronization work?

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop

2012-07-25 Thread liu ping fan
On Wed, Jul 25, 2012 at 3:37 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/07/2012 09:03, Stefan Hajnoczi ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 iohandler/bh/timer may use DeviceState when its refcnt=0,

 It's not clear how to me.  The only reference to devices from an
 iohandler/bh/timer can be in the opaque.  Now, if you have a
 iohandler/bh/timer whose opaque is a DeviceState, you should bump the
 refcount before setting it up, and unref after tearing it down.

Yes, I admit refcnt is a good solution, but I think it means that we
will fix it with each device's bh. And this way seems lazy.

Thanx pingfan
 See for example how hw/scsi-disk.c bumps the refcount of a request
 before starting asynchronous I/O and calls unref at the end of the
 asynchronous I/O callback.

 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop

2012-07-25 Thread liu ping fan
On Wed, Jul 25, 2012 at 3:03 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Wed, Jul 25, 2012 at 4:31 AM, Liu Ping Fan qemul...@gmail.com wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 iohandler/bh/timer may use DeviceState when its refcnt=0,
 postpone the reclaimer till they have done with it.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/qemu/object.h |2 +-
  main-loop.c   |4 
  main-loop.h   |2 ++
  qemu-tool.c   |4 
  qom/Makefile.objs |2 +-
  qom/object.c  |7 ++-
  qom/reclaimer.c   |   41 +
  7 files changed, 59 insertions(+), 3 deletions(-)
  create mode 100644 qom/reclaimer.c

 diff --git a/include/qemu/object.h b/include/qemu/object.h
 index 8b17776..b233ee4 100644
 --- a/include/qemu/object.h
 +++ b/include/qemu/object.h
 @@ -958,5 +958,5 @@ int object_child_foreach(Object *obj, int (*fn)(Object 
 *child, void *opaque),
   */
  Object *container_get(Object *root, const char *path);

 -
 +void qemu_reclaimer_enqueue(Object *obj);
  #endif
 diff --git a/main-loop.c b/main-loop.c
 index eb3b6e6..f9cecc5 100644
 --- a/main-loop.c
 +++ b/main-loop.c
 @@ -505,5 +505,9 @@ int main_loop_wait(int nonblocking)
 them.  */
  qemu_bh_poll();

 +/* ref to device from iohandler/bh/timer do not obey the rules, so delay
 + * reclaiming until now.
 + */
 +qemu_device_reclaimer();
  return ret;
  }
 diff --git a/main-loop.h b/main-loop.h
 index cedddf5..1a59a6d 100644
 --- a/main-loop.h
 +++ b/main-loop.h
 @@ -367,4 +367,6 @@ void qemu_bh_schedule_idle(QEMUBH *bh);
  int qemu_bh_poll(void);
  void qemu_bh_update_timeout(uint32_t *timeout);

 +void qemu_device_reclaimer(void);
 +
  #endif
 diff --git a/qemu-tool.c b/qemu-tool.c
 index 318c5fc..34d959b 100644
 --- a/qemu-tool.c
 +++ b/qemu-tool.c
 @@ -75,6 +75,10 @@ void qemu_mutex_unlock_iothread(void)
  {
  }

 +void qemu_device_reclaimer(void)
 +{
 +}
 +
  int use_icount;

  void qemu_clock_warp(QEMUClock *clock)
 diff --git a/qom/Makefile.objs b/qom/Makefile.objs
 index 5ef060a..a579261 100644
 --- a/qom/Makefile.objs
 +++ b/qom/Makefile.objs
 @@ -1,4 +1,4 @@
 -qom-obj-y = object.o container.o qom-qobject.o
 +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
  qom-obj-twice-y = cpu.o
  common-obj-y = $(qom-obj-twice-y)
  user-obj-y = $(qom-obj-twice-y)
 diff --git a/qom/object.c b/qom/object.c
 index 00bb3b0..227d966 100644
 --- a/qom/object.c
 +++ b/qom/object.c
 @@ -649,7 +649,12 @@ void object_unref(Object *obj)

  /* parent always holds a reference to its children */
  if (obj-ref == 0) {
 -object_finalize(obj);
 +/* fixme, maybe introduce obj-finalze to make this more elegant */
 +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) {

 hw/qdev.h:#define TYPE_DEVICE device

 This should be object_dynamic_cast(obj, TYPE_DEVICE).

Yes, thanks.
 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >