Re: [RFC 01/11] sched: introduce sys_cpumask in tsk to adapt asymmetric system
On Wed, Nov 12, 2014 at 5:22 PM, Srikar Dronamraju sri...@linux.vnet.ibm.com wrote: * kernelf...@gmail.com kernelf...@gmail.com [2014-10-16 15:29:50]: Some system such as powerpc, some tsk (vcpu thread) can only run on the dedicated cpu. Since we adapt some asymmetric method to monitor the whole physical cpu. (powerKVM only allows the primary hwthread to set up runtime env for the secondary when entering guest). Nowadays, powerKVM run with all the secondary hwthread offline to ensure the vcpu threads only run on the primary thread. But we plan to keep all cpus online when running powerKVM to give more power when switching back to host, so introduce sys_allowed cpumask to reflect the cpuset which the vcpu thread can run on. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/init_task.h | 1 + include/linux/sched.h | 6 ++ kernel/sched/core.c | 10 -- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 2bb4c4f3..c56f69e 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -172,6 +172,7 @@ extern struct task_group root_task_group; .normal_prio= MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ + .sys_allowed = CPU_MASK_ALL,\ Do we really need another mask, cant we just use cpus_allowed itself. I think it is not easy to cast two request: chip inherit and user's configuration onto one mask. .nr_cpus_allowed= NR_CPUS, \ .mm = NULL, \ .active_mm = init_mm, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5c2c885..ce429f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1260,7 +1260,10 @@ struct task_struct { unsigned int policy; int nr_cpus_allowed; + /* Anded user and sys_allowed */ cpumask_t cpus_allowed; + /* due to the feature of asymmetric, some tsk can only run on such cpu */ + cpumask_t sys_allowed; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; @@ -2030,6 +2033,9 @@ static inline void tsk_restore_flags(struct task_struct *task, } #ifdef CONFIG_SMP +extern void set_cpus_sys_allowed(struct task_struct *p, + const struct cpumask *new_mask); + extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ec1a286..2cd1ae3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4596,13 +4596,19 @@ void init_idle(struct task_struct *idle, int cpu) } #ifdef CONFIG_SMP +void set_cpus_sys_allowed(struct task_struct *p, + const struct cpumask *new_mask) +{ + cpumask_copy(p-sys_allowed, new_mask); +} + This function doesnt seem to be used anywhere... Not sure why it is introduced Not layered the patches well :( It is used later in the series. Thx, Fan void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) { if (p-sched_class p-sched_class-set_cpus_allowed) p-sched_class-set_cpus_allowed(p, new_mask); - cpumask_copy(p-cpus_allowed, new_mask); - p-nr_cpus_allowed = cpumask_weight(new_mask); + cpumask_and(p-cpus_allowed, p-sys_allowed, new_mask); + p-nr_cpus_allowed = cpumask_weight(p-cpus_allowed); } /* -- 1.8.3.1 -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 03/11] powerpc: kvm: add interface to control kvm function on a core
On Mon, Oct 27, 2014 at 12:04 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Liu, On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: When kvm is enabled on a core, we migrate all external irq to primary thread. Since currently, the kvmirq logic is handled by the primary hwthread. Todo: this patch lacks re-enable of irqbalance when kvm is disable on the core Why is a sysfs file introduced to trigger irq migration? Why is it not done during kvm module insert ? And similarly spread interrupts when the module is removed? Isn't this a saner way ? Consider the scene: coreA and coreB, we want to enable KVM on coreA, while keeping coreB unchanged. In fact, I try to acheive something un-symmetric on the platform. Do you think it is an justification? Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c| 39 ++ arch/powerpc/sysdev/xics/xics-common.c | 12 +++ 2 files changed, 51 insertions(+) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 67fd2fd..a2595dd 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void) if (cpu_has_feature(CPU_FTR_DSCR)) err = device_create_file(cpu_subsys.dev_root, dev_attr_dscr_default); } + +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core) +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly + +static ssize_t show_kvm_enable(struct device *dev, + struct device_attribute *attr, char *buf) +{ +} + +static ssize_t __used store_kvm_enable(struct device *dev, + struct device_attribute *attr, const char *buf, + size_t count) +{ + struct cpumask stop_cpus; + unsigned long core, thr; + + sscanf(buf, %lx, core); + if (core NR_CORES) + return -1; + if (!test_bit(core, kvm_on_core)) + for (thr = 1; thr threads_per_core; thr++) + if (cpu_online(thr * threads_per_core + thr)) + cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); What is the above logic trying to do? Did you mean cpu_online(threads_per_core * core + thr) ? Yeah. My mistake, should be cpumask_set_cpu(core * threads_per_core + thr, stop_cpus) + + stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + set_bit(core, kvm_on_core); + return count; +} + +static DEVICE_ATTR(kvm_enable, 0600, + show_kvm_enable, store_kvm_enable); + +static void sysfs_create_kvm_enable(void) +{ + device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable); +} +#endif + #endif /* CONFIG_PPC64 */ #ifdef HAS_PPC_PMC_PA6T diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index fe0cca4..68b33d8 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -258,6 +258,18 @@ unlock: raw_spin_unlock_irqrestore(desc-lock, flags); } } + +int xics_migrate_irqs_away_secondary(void *data) +{ + int cpu = smp_processor_id(); + if(cpu%thread_per_core != 0) { + WARN(condition, format...); + return 0; + } + /* In fact, if we can migrate the primary, it will be more fine */ + (); Isn't the aim of the patch to migrate irqs away from the secondary onto the primary? But from above it looks like we are returning when we find out that we are secondary threads, isn't it? Yes, will fix in next version. + return 0; +} #endif /* CONFIG_HOTPLUG_CPU */ Note that xics_migrate_irqs_away() is defined under CONFIG_CPU_HOTPLUG. But we will need this option on PowerKVM even when hotplug is not configured in. Yes, will fix the dependency in next version Thx, Fan Regards Preeti U Murthy #ifdef CONFIG_SMP -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless
On Mon, Oct 27, 2014 at 12:45 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: (This patch is a place holder.) If there is only one vcpu thread is ready(the other vcpu thread can wait for it to execute), the primary thread can enter tickless mode, We do not configure NOHZ_FULL to y by default. Hence no thread would enter tickless mode. But NOHZ_FULL can be chosen by user, and we should survive from it :) which causes the primary keeps running, so the secondary has no opportunity to exit to host, even they have other tsk on them. The secondary threads can still get scheduling ticks. The decrementer of the secondary threads is still active. So as long as secondary threads are busy, scheduling ticks will fire and try to schedule a new task on them. No. As my original thought, after enable KVM on core, the HDEC on secondary is disabled, otherwise the host exit will be too frequent. Any suggestion? Thx, Fan Regards Preeti U Murthy Introduce a kthread (anti_tickless) on primary, so when there is only one vcpu thread on primary, the secondary can resort to anti_tickless to keep the primary out of tickless mode. (I thought that anti_tickless thread can goto NAP, so we can let the secondary run). Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index a2595dd..f0b110e 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device *dev, if (!test_bit(core, kvm_on_core)) for (thr = 1; thr threads_per_core; thr++) if (cpu_online(thr * threads_per_core + thr)) - cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); + cpumask_set_cpu(core * threads_per_core + thr, stop_cpus); stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + /* fixme, create a kthread on primary hwthread to handle tickless mode */ + //kthread_create_on_cpu(prevent_tickless, NULL, core * threads_per_core, ppckvm_prevent_tickless); set_bit(core, kvm_on_core); return count; } -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host
On Mon, Oct 27, 2014 at 1:32 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Liu, On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: Nowadays, powerKVM runs with secondary hwthread offline. Although we can make all secondary hwthread online later, we still preserve this behavior for dedicated KVM env. Achieve this by setting paca-online as false. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/paca.h | 3 +++ arch/powerpc/kernel/asm-offsets.c | 3 +++ arch/powerpc/kernel/smp.c | 3 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 4 files changed, 21 insertions(+) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index a5139ea..67c2500 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -84,6 +84,9 @@ struct paca_struct { u8 cpu_start; /* At startup, processor spins until */ /* this becomes non-zero. */ u8 kexec_state; /* set when kexec down has irqs off */ +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + u8 online; +#endif #ifdef CONFIG_PPC_STD_MMU_64 struct slb_shadow *slb_shadow_ptr; struct dtl_entry *dispatch_log; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9d7dede..0faa8fe 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -182,6 +182,9 @@ int main(void) DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc)); DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase)); DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr)); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + DEFINE(PACAONLINE, offsetof(struct paca_struct, online)); +#endif DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled)); DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index a0738af..4c3843e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -736,6 +736,9 @@ void start_secondary(void *unused) cpu_startup_entry(CPUHP_ONLINE); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + get_paca()-online = true; +#endif BUG(); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index f0c4db7..d5594b0 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -322,6 +322,13 @@ kvm_no_guest: li r0, KVM_HWTHREAD_IN_NAP stb r0, HSTATE_HWTHREAD_STATE(r13) kvm_do_nap: +#ifdef PPCKVM_ENABLE_SECONDARY + /* check the cpu is needed by host or not */ + ld r2, PACAONLINE(r13) + ld r3, 0 + cmp r2, r3 + bne kvm_secondary_exit_trampoline +#endif /* Clear the runlatch bit before napping */ mfspr r2, SPRN_CTRLF clrrdi r2, r2, 1 @@ -340,6 +347,11 @@ kvm_do_nap: nap b . +#ifdef PPCKVM_ENABLE_SECONDARY +kvm_secondary_exit_trampoline: + b . Uh? When we have no vcpu to run, we loop here instead of doing a nap? What are we achieving? If I understand the intention of the patch well, we are looking to provide a knob whereby the host can indicate if it needs the secondaries at all. Yes, you catch it :) Today the host does boot with all threads online. There are some init scripts which take the secondaries down. So today the host does not have a say in preventing this, compile time or runtime. So lets see how we can switch between the two behaviors if we don't have the init script, which looks like a saner thing to do. We should set the paca-online flag to false by default. If KVM_PPC_ENABLE_SECONDARY is configured, we need to set this flag to true. So at compile time, we resolve the flag. While booting, we look at the flag and decide whether to get the secondaries online. So we get the current behavior if we have not configured KVM_PPC_ENABLE_SECONDARY. Will this achieve the purpose of this patch? At boot time, KVM can not run. So we can achieve the change of the flag by soft cpu hotplug on/off. I think this is a more flexible way. Thx, Fan Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 09/11] powerpc: kvm: handle time base on secondary hwthread
On Mon, Oct 27, 2014 at 2:40 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: (This is a place holder patch.) We need to store the time base for host on secondary hwthread. Later when switching back, we need to reprogram it with elapse time. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 89ea16c..a817ba6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* fixme: store other register such as msr */ + /* fixme: store the tb, and set it as MAX, so we cease the tick on secondary */ This can lead to serious consequences. First of all, even if we set the decrementer(not tb) to MAX, it is bound to fire after 4s. That is the maximum time till which you can keep off the decrementer from firing. In the hotplug path, the offline cpus nap and their decrementers are programmed to fire at MAX too. But the difference is that we clear the LPCR_PECE1 wakeup bit which prevents cpus from waking up on a decrementer interrupt. We cannot afford to do this here though because there are other tasks on the secondary threads' runqueue. They need to be scheduled in. There are also timers besides the tick_sched one, which can be queued on these secondary threads. These patches have not taken care to migrate timers or tasks before entering guest as far as I observed. Hence we cannot just turn off time base like this and expect to handle the above mentioned events the next time the primary thread decides to exit to the host. Yes, that is the nut in this series. My plan is to compensate the hrtimer when the secondary exit. But as to the scheduler on secondary, if it is ceased, what is side-effect? Thx, Fan Regards Preeti U Murthy + /* prevent us to enter kernel */ li r0, 1 stb r0, HSTATE_HWTHREAD_REQ(r13) @@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* enter with vmode */ kvmppc_secondary_stopper_exit: + /* fixme: restore the tb, with the orig val plus time elapse + * so we can fire the hrtimer as soon as possible + */ + /* fixme, restore the stack which we store on lpaca */ ld r0, 112+PPC_LR_STKOFF(r1) -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread
On Mon, Oct 27, 2014 at 2:44 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 01:00 AM, kernelf...@gmail.com wrote: Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/Kconfig | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 602eb51..de38566 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV If unsure, say N. +config KVMPPC_ENABLE_SECONDARY + tristate KVM support for running on secondary hwthread in host + depends on KVM_BOOK3S_64_HV This patch is required ontop of all the rest :) The top patches won't compile without this one. Every patch in the patchset should be able to compile successfully without the aid of the patches that come after it. I think here is a conflict. If we do so, then we should make effort to prevent the independent patch to take effect before the whole patchset is applied. Thx, Fan Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
On Tue, Jul 29, 2014 at 2:57 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote: Hope I am right. Take the following seq as an example if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); } else { kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - if we try_to_unmap on pfn at here, then @r contains a invalid pfn hptep[1] = r; eieio(); hptep[0] = hpte[0]; asm volatile(ptesync : : : memory); If that was the case we would have the same race in kvmppc_do_h_enter(). I think the fact that the HPTE is locked will prevent the race, ie, HPTE_V_HVLOCK is set until hptep[0] is written to. If I look at at the unmap case, my understanding is that it uses kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte) and so shouldn't have a race vs the above code. Yes, you are right :) Thx, Fan Or do you see a race I don't ? Cheers, Ben. Thx. Fan On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
Hope I am right. Take the following seq as an example if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); } else { kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - if we try_to_unmap on pfn at here, then @r contains a invalid pfn hptep[1] = r; eieio(); hptep[0] = hpte[0]; asm volatile(ptesync : : : memory); Thx. Fan On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect
Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to map it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will be mapped in virt mode and have the opportunity to be checked with placement. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v4: more detail description --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..8fcc363 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect
Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to map it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will be mapped in virt mode and have the opportunity to be checked with placement. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v4: more detail description --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..8fcc363 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote: On 13.04.14 04:27, Liu ping fan wrote: On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote: On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote: When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. What happens when we don't? Why do we need the check? Why isn't it done implicitly? What happens when we treat a NUMA marked page as non-present? Why does it work out for us? Assume you have no idea what PAGE_NUMA is, but try to figure out what this patch does and whether you need to cherry-pick it into your downstream kernel. The description as is still is not very helpful for that. It doesn't even explain what really changes with this patch applied. Yeah. what about appending the following description? Can it make the context clear? Guest should not setup a hpte for the page whose pte is marked with _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect to check whether the page is placed correctly or not. Try to come up with a text that answers the following questions in order: I divide them into 3 groups, and answer them by 3 sections. Seems that it has the total story :) Please take a look. - What does _PAGE_NUMA mean? Group 1 - section 2 - How does page migration with _PAGE_NUMA work? - Why should we not map pages when _PAGE_NUMA is set? Group 2 - section 1 (Note: for the 1st question in this group, I am not sure about the details, except that we can fix numa balancing by moving task or moving page. So I comment as migration should be involved to cut down the distance between the cpu and pages) - Which part of what needs to be done did the previous _PAGE_NUMA patch address? - What's the situation without this patch? - Which scenario does this patch fix? Group 3 - section 3 Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to fix it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will be fixed in virt mode and have the opportunity to be checked with placement. Thx, Fan Once you have a text that answers those, you should have a good patch description :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote: On 13.04.14 04:27, Liu ping fan wrote: On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote: On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote: When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. What happens when we don't? Why do we need the check? Why isn't it done implicitly? What happens when we treat a NUMA marked page as non-present? Why does it work out for us? Assume you have no idea what PAGE_NUMA is, but try to figure out what this patch does and whether you need to cherry-pick it into your downstream kernel. The description as is still is not very helpful for that. It doesn't even explain what really changes with this patch applied. Yeah. what about appending the following description? Can it make the context clear? Guest should not setup a hpte for the page whose pte is marked with _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect to check whether the page is placed correctly or not. Try to come up with a text that answers the following questions in order: I divide them into 3 groups, and answer them by 3 sections. Seems that it has the total story :) Please take a look. - What does _PAGE_NUMA mean? Group 1 - section 2 - How does page migration with _PAGE_NUMA work? - Why should we not map pages when _PAGE_NUMA is set? Group 2 - section 1 (Note: for the 1st question in this group, I am not sure about the details, except that we can fix numa balancing by moving task or moving page. So I comment as migration should be involved to cut down the distance between the cpu and pages) - Which part of what needs to be done did the previous _PAGE_NUMA patch address? - What's the situation without this patch? - Which scenario does this patch fix? Group 3 - section 3 Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to fix it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will be fixed in virt mode and have the opportunity to be checked with placement. Thx, Fan Once you have a text that answers those, you should have a good patch description :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote: On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote: When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. What happens when we don't? Why do we need the check? Why isn't it done implicitly? What happens when we treat a NUMA marked page as non-present? Why does it work out for us? Assume you have no idea what PAGE_NUMA is, but try to figure out what this patch does and whether you need to cherry-pick it into your downstream kernel. The description as is still is not very helpful for that. It doesn't even explain what really changes with this patch applied. Yeah. what about appending the following description? Can it make the context clear? Guest should not setup a hpte for the page whose pte is marked with _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect to check whether the page is placed correctly or not. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v3: rebased onto Alex's tree, branch kvm-ppc-next substitue commit log with the more clear description in Aneesh's reply (thanks, Aneesh) --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..1117525 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte)!pte_numa(pte)) { Spaces missing Will fix, Thx, Fan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v3: rebased onto Alex's tree, branch kvm-ppc-next substitue commit log with the more clear description in Aneesh's reply (thanks, Aneesh) --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..1117525 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte)!pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
On Mon, Apr 7, 2014 at 4:36 PM, Alexander Graf ag...@suse.de wrote: On 07.04.14 09:42, Aneesh Kumar K.V wrote: Alexander Graf ag...@suse.com writes: On 03.04.14 04:36, Liu ping fan wrote: Hi Alex, could you help to pick up this patch? since v3.14 kernel can enable numa fault for powerpc. What bad happens without this patch? We map a page even though it was declared to get NUMA migrated? What happens next? Nothing much, we won't be properly accounting the numa access in the host. What we want to achieve is to convert a guest access of the page to a host fault so that we can do proper numa access accounting in the host. This would enable us to migrate the page to the correct numa node. Ok, so no breakages, just less performance. I wouldn't consider it stable material then :). Sorry to reply late, since I am half out of office during this period. I am puzzling about you reply.Without this patch, the guest can NOT sense the numa changes and ask host to put the pages in right place. So the pages which is used by guest will be always misplaced. The numa-fault method is inspired by real requirement to improve performance, so we should also consider the performance drop of guest. Right? Regards, Fan Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
Hi Alex, could you help to pick up this patch? since v3.14 kernel can enable numa fault for powerpc. Thx, Fan On Mon, Jan 27, 2014 at 5:11 PM, Alexander Graf ag...@suse.de wrote: On 21.01.2014, at 10:42, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu Ping Fan kernelf...@gmail.com writes: To make sure that on host, the pages marked with _PAGE_NUMA result in a fault when guest access them, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. This patch does that. So what happens next? We insert a page into the HTAB without HPTE_V_VALID set, so the guest will fail to use it. If the guest does an H_READ on it it will suddenly turn to V_VALID though? I might need a crash course in the use of HPTE_V_ABSENT. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
); } } if (clock_gettime(CLOCK_REALTIME, now) == -1) handle_error(clock_gettime); /* Create a CLOCK_REALTIME absolute timer with initial expiration and interval as specified in command line */ new_value.it_value.tv_sec = now.tv_sec + elapsed; new_value.it_value.tv_nsec = now.tv_nsec; new_value.it_interval.tv_sec = 0; new_value.it_interval.tv_nsec = 0; if (timerfd_settime(fd, TFD_TIMER_ABSTIME, new_value, NULL) == -1) handle_error(timerfd_settime); pfd.fd = fd; pfd.events = POLLIN; pfd.revents = 0; /* -1: infinite wait */ poll(pfd, 1, -1); /* ask children to stop and get back cnt */ *(pmap + SHM_CMD_OFF) = CMD_STOP; wait(NULL); dst_info = (char *)(pmap + SHM_MESSAGE_OFF); printf(dst_info); printf(total cnt:%lu\n, *(pmap + SHM_CNT_OFF)); munmap(pmap, PAGE_SIZE); shm_unlink(SHM_FNAME); } On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf ag...@suse.de wrote: On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. Paul, could you please give this some thought and maybe benchmark it? Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault when guest access them, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v2: It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect And I imporve the changelog according to Aneesh's suggestion. --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..af8602d 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault when guest access them, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v2: It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect And I imporve the changelog according to Aneesh's suggestion. --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..af8602d 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. Now guest receive that fault, removes the PAGE_NUMA bit and do an hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have cleared PAGE_NUMA bit. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. hmm ? hpte_hv_fault is the hypervisor handling the fault. After we discuss in irc. I think we should also do the fast check in kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT, and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right? Thanks and regards, Fan -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan kernelf...@gmail.com wrote: On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. Now guest receive that fault, removes the PAGE_NUMA bit and do an hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have cleared PAGE_NUMA bit. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. hmm ? hpte_hv_fault is the hypervisor handling the fault. After we discuss in irc. I think we should also do the fast check in kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT, and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right? Or we can delay the quick fix in H_ENTER, and let the host fault again, so do the fix in kvmppc_hpte_hv_fault() Thanks and regards, Fan -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. Thanks and regards, Fan If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. -aneesh -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. Thanks and regards, Ping Fan Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] powernv: kvm: numa fault improvement
This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. If my suppose is correct, will CCing k...@vger.kernel.org from next version. Liu Ping Fan (4): mm: export numa_migrate_prep() powernv: kvm: make _PAGE_NUMA take effect powernv: kvm: extend input param for lookup_linux_pte powernv: kvm: make the handling of _PAGE_NUMA faster for guest arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++--- include/linux/mm.h | 2 ++ 2 files changed, 37 insertions(+), 3 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte
It will be helpful for next patch Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Can it be merged with the next patch? --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index af8602d..ae46052 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, } static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, - int writing, unsigned long *pte_sizep) + int writing, unsigned long *pte_sizep, + pte_t **ptepp) { pte_t *ptep; unsigned long ps = *pte_sizep; @@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift); if (!ptep) return __pte(0); + if (ptepp != NULL) + *ptepp = ptep; if (hugepage_shift) *pte_sizep = 1ul hugepage_shift; else @@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; - pte = lookup_linux_pte(pgdir, hva, writing, pte_size); + pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL); if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ @@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags, memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn); if (memslot) { hva = __gfn_to_hva_memslot(memslot, gfn); - pte = lookup_linux_pte(pgdir, hva, 1, psize); + pte = lookup_linux_pte(pgdir, hva, 1, psize, + NULL); if (pte_present(pte) !pte_write(pte)) r = hpte_make_readonly(r); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
To make _PAGE_NUMA take effect, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..af8602d 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest
The period check of _PAGE_NUMA can probably happen on the correctly placed page. For this case, when guest try to setup hpte in real mode, we try to resolve the numa fault in real mode, since the switch between guest context and host context costs too much. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index ae46052..a06b199 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, unsigned int writing; unsigned long mmu_seq; unsigned long rcbits; + struct mm_struct *mm = kvm-mm; + struct vm_area_struct *vma; + int page_nid, target_nid; + struct page *test_page; + pte_t *ptep; psize = hpte_page_size(pteh, ptel); if (!psize) @@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; - pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL); - if (pte_present(pte) !pte_numa(pte)) { + pte = lookup_linux_pte(pgdir, hva, writing, pte_size, ptep); + if (pte_present(pte)) { + if (pte_numa(pte)) { + /* If fail, let gup handle it */ + if (unlikely(!down_read_trylock(mm-mmap_sem))) + goto pte_check; + + vma = find_vma(mm, hva); + up_read(mm-mmap_sem); + test_page = pte_page(pte); + page_nid = page_to_nid(test_page); + target_nid = numa_migrate_prep(test_page, vma, +hva, page_nid); + put_page(test_page); + if (unlikely(target_nid != -1)) { + /* If fail, let gup handle it */ + goto pte_check; + } + } + if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); @@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, } } +pte_check: if (pte_size psize) return H_PARAMETER; if (pa pte_size psize) @@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pteh = ~HPTE_V_VALID; unlock_rmap(rmap); } else { + if (pte_numa(pte) pa) { + pte = pte_mknonnuma(pte); + *ptep = pte; + } kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index, realmode); /* Only set R/C in real HPTE if already set in *rmap */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] mm: export numa_migrate_prep()
powerpc will use it in fast path. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5ab0e22..420fb77 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); +extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid); /* * doesn't attempt to fault and will return short. -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote: On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Any particular reason for the resend? The patch is already applied, no? Oh, seems that I misunderstood your meaning. You said Actually, I've changed my mind and moved the patch to the for-3.13 branch instead. Please make sure to CC kvm@vger on all patches you submit though. So I think it is necessary to resend with cc kvm@vger Regards, Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote: On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Any particular reason for the resend? The patch is already applied, no? Oh, seems that I misunderstood your meaning. You said Actually, I've changed my mind and moved the patch to the for-3.13 branch instead. Please make sure to CC kvm@vger on all patches you submit though. So I think it is necessary to resend with cc kvm@vger Regards, Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v3: add some document --- arch/powerpc/kvm/book3s_hv.c| 10 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++- 2 files changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..1addb1a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } + /* hypercall with MSR_PR has already been handled in rmode, +* and never reaches here. +*/ + run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..eea7ca7 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) 5: mtspr SPRN_SRR0, r6 mtspr SPRN_SRR1, r7 +/* + * Required state: + * R4 = vcpu + * R10: value for HSRR0 + * R11: value for HSRR1 + * R13 = PACA + */ fast_guest_return: li r0,0 stb r0,VCPU_CEDED(r4) /* cancel cede */ @@ -1388,7 +1395,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1417,15 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + mr r4,r9 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..abf81fe 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, slb_v = vcpu-kvm-arch.vrma_slb_v; } + preempt_disable(); /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..ea17b30 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +/* When called from virtmode, this func should be protected by + * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK + * can trigger deadlock issue. + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..abf81fe 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, slb_v = vcpu-kvm-arch.vrma_slb_v; } + preempt_disable(); /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..ea17b30 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +/* When called from virtmode, this func should be protected by + * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK + * can trigger deadlock issue. + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v3: add some document --- arch/powerpc/kvm/book3s_hv.c| 10 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++- 2 files changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..1addb1a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } + /* hypercall with MSR_PR has already been handled in rmode, +* and never reaches here. +*/ + run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..eea7ca7 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) 5: mtspr SPRN_SRR0, r6 mtspr SPRN_SRR1, r7 +/* + * Required state: + * R4 = vcpu + * R10: value for HSRR0 + * R11: value for HSRR1 + * R13 = PACA + */ fast_guest_return: li r0,0 stb r0,VCPU_CEDED(r4) /* cancel cede */ @@ -1388,7 +1395,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1417,15 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + mr r4,r9 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +++- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..0d1e2c2 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,15 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + mr r4,r9 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..a463f08 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,14 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v4: remove the over-engineered part and keep it simple, also add some notes. --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..abf81fe 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, slb_v = vcpu-kvm-arch.vrma_slb_v; } + preempt_disable(); /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..ea17b30 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +/* When called from virtmode, this func should be protected by + * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK + * can trigger deadlock issue. + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 12:19 PM, Liu ping fan kernelf...@gmail.com wrote: On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call Interrupt, no description about the MSR_PR when sc trigger a syscall interrupt. So I think, guest application sc 0 will also fall to the kernel who owns hypervisor mode. Am I right? Some further comment: I think the essential of the problem is whether we switch RMA from guest to HV when interrupts raise. DSI/ISI will be redirected to HDSI and RMA switch. But what about SYSCALL, and DEC, external interrupt, ...etc? I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Sorry, but just want to make clear about the idiom: 0 - kernel run with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up layer? Right? When you say try to not bounce to the 1st hypervisor , what is the exact meaning and how can we achieve this? I am a quite newer on powerpc, and hope that I can get more clear figure about it :) Thanks Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene
On Thu, Nov 7, 2013 at 5:54 PM, Alexander Graf ag...@suse.de wrote: On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Very nice catch :). I think it makes a lot of sense to document the fact that kvmppc_hv_find_lock_hpte() should be called with preemption disabled in a comment above the function, so that next time when someone potentially calls it, he knows that he needs to put preempt_disable() around it. Ok, I will document them in v3 Thanks a lot for finding this pretty subtle issue. May I ask how you got there? Did you actually see systems deadlock because of this? Intuition :). then I begin try to model a scene which causes the deadlock. And fortunately, I find a case to verify my suspension. Regards, Pingfan Alex --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 043eec8..dbc1478 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] powerpc/kvm: remove redundant assignment
On Thu, Nov 7, 2013 at 6:06 PM, Alexander Graf ag...@suse.de wrote: On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote: ret is assigned twice with the same value, so remove the later one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Acked-by: Paul Mackerras pau...@samba.org I suppose my last request for a patch description was slightly too abbreviated :). Sorry about that. Imagine you are a Linux-stable maintainer. You have about 5000 patches in front of you and you want to figure out whether a patch should get backported into a stable tree or not. It's very easy to read through the patch description. It's reasonably easy to do a git show on the patch. It's very hard to look at the actual surrounding code that was changed. If I open a text editor on the file, I immediately see what you're saying: ret = RESUME_GUEST; preempt_disable(); while (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) cpu_relax(); if ((hptep[0] ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] || rev-guest_rpte != hpte[2]) /* HPTE has been changed under us; let the guest retry */ goto out_unlock; hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; rmap = memslot-arch.rmap[gfn - memslot-base_gfn]; lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; However, that scope is not given in the actual patch itself. If you look at the diff below, you have no idea whether the patch is fixing a bug or just removes duplication and doesn't actually have any effect. In fact, the compiled assembly should be the same with this patch and without. But you can't tell from the diff below. So what I would like to see in the patch description is something that makes it easy to understand what's going on without the need to check out the source file. Something like We redundantly set ret to RESUME_GUEST twice without changing it in between. Only do it once: ret = RESUME_GUEST; preempt_disable(); while (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) cpu_relax(); if ((hptep[0] ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] || rev-guest_rpte != hpte[2]) /* HPTE has been changed under us; let the guest retry */ goto out_unlock; hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; rmap = memslot-arch.rmap[gfn - memslot-base_gfn]; lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; If I look at that patch description it immediately tells me Ah, no need to worry, it's not a critical bug I need to backport. If you have a better idea how to express that I'm more than happy to take that too. Otherwise just let me know whether you like the description above and I'll modify it to the one that includes the code snippet when applying the patch. Oh, yes. Thank you very much.. And I had a better understanding of the heavy work of maintainers :) Will keep this in mind. Best regards, Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] powerpc: kvm: optimize sc 0 as fast return
syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Compiled, but lack of bare metal, I have not tested it yet. --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 - 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..9f626c3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_0_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,16 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_0_fast_return: + ld r10,VCPU_PC(r9) + ld r11,VCPU_MSR(r9) + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + LOAD_REG_IMMEDIATE(r3,0x87a0) /* zero 33:36,42:47 */ + and r11,r11,r3 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call Interrupt, no description about the MSR_PR when sc trigger a syscall interrupt. So I think, guest application sc 0 will also fall to the kernel who owns hypervisor mode. Am I right? I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Sorry, but just want to make clear about the idiom: 0 - kernel run with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up layer? Right? When you say try to not bounce to the 1st hypervisor , what is the exact meaning and how can we achieve this? I am a quite newer on powerpc, and hope that I can get more clear figure about it :) Thanks Pingfan Alex Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Compiled, but lack of bare metal, I have not tested it yet. --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 - 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..9f626c3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_0_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,16 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_0_fast_return: + ld r10,VCPU_PC(r9) + ld r11,VCPU_MSR(r9) + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + LOAD_REG_IMMEDIATE(r3,0x87a0) /* zero 33:36,42:47 */ + and r11,r11,r3 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 12:11 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Fri, 2013-11-08 at 15:05 +1100, Benjamin Herrenschmidt wrote: On Fri, 2013-11-08 at 04:10 +0100, Alexander Graf wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Well, so unfortunately openstack CI uses PR inside HV pretty heavily it *might* be worthwhile optimizing that path if the patch is simple enough... I'd make that Paul's call. Note that this is a statement of value for the idea ... not the implementation ;-) From a quick look with Paulus, the patch is quite broken. I'll let Paul comment in details. Thank you very much, Regards, Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/2] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s.h | 4 ++-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 5 +++-- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 20 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index a818932..3d710ba 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -129,9 +129,9 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu); extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned long addr, unsigned long status); -extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val); +extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val, bool vmode); extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, - unsigned long slb_v, unsigned long valid); + unsigned long slb_v, unsigned long valid, bool vmode); extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte); extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 97685e7..12d9635 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -475,13 +475,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, -HPTE_V_VALID | HPTE_V_ABSENT); +HPTE_V_VALID | HPTE_V_ABSENT, +true); if (index 0) return -ENOENT; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0]; gr = kvm-arch.revmap[index].guest_rpte; - kvmppc_hv_unlock_hpte(hptep, v); + kvmppc_hv_unlock_hpte(hptep, v, true); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 0ff9e91..18a9425 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,16 +749,22 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; -void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val) +void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val, + bool vmode) { *hpte_val = *hpte_val ~HPTE_V_HVLOCK; asm volatile(lwsync : : : memory); *hptep = *hpte_val; + if (unlikely(vmode)) + preempt_enable(); } EXPORT_SYMBOL(kvmppc_hv_unlock_hpte); +/* If called from virtmode and success to lock, then the context will be set + * as preemption disabled + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, - unsigned long valid) + unsigned long valid, bool vmode) { unsigned int i; unsigned int pshift; @@ -796,6 +802,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, avpn = ~0x7fUL; val |= avpn; + if (unlikely(vmode)) + preempt_disable(); + for (;;) { hpte = (unsigned long *)(kvm-arch.hpt_virt + (hash 7)); @@ -833,6 +842,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, val |= HPTE_V_SECONDARY; hash = hash ^ kvm-arch.hpt_mask; } + + if (unlikely(vmode)) + preempt_enable(); return -1; } EXPORT_SYMBOL(kvmppc_hv_find_lock_hpte); @@ -864,7 +876,7 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr, if (status DSISR_NOHPTE) valid |= HPTE_V_ABSENT; - index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid); + index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid, false); if (index 0) { if (status DSISR_NOHPTE) return status
[PATCH v3 0/2] powerpc kvm: fix deadlock scene
v2-v3: introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte() and hide the preemption detail inside this pair from the callers Liu Ping Fan (2): powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte powerpc: kvm: fix rare but potential deadlock scene arch/powerpc/include/asm/kvm_book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 -- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 29 - 3 files changed, 30 insertions(+), 12 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] powerpc/kvm: remove redundant assignment
ret is assigned twice with the same value, so remove the later one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Acked-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index dbc1478..9b97b42 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -733,7 +733,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ - ret = RESUME_GUEST; if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene
On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote: On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Good catch, we should have preemption disabled while ever we have a HPTE locked. @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); + preempt_enable(); Which means we need to add the preempt_enable after unlocking the HPTE, not here. Yes. Sorry, but I am not sure about whether we can call preempt_disable/enable() in realmode. I think since thread_info is allocated with linear address, so we can use preempt_disable/enable() inside kvmppc_hv_find_lock_hpte(), right? Thanks and regards, Pingfan Regards, Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 043eec8..28160ac 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); + preempt_enable(); if (index 0) return -ENOENT; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] powerpc/kvm: remove redundant assignment
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 28160ac..7682837 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -731,7 +731,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ - ret = RESUME_GUEST; if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread
After the primary vcpu changes vcore_state to VCORE_RUNNING, there is very little chance to schedule to secondary vcpu. So if we change the code sequence around set vcore_state to VCORE_RUNNING and disable preemption, we lost little. But we simplify the entering logi, based on the fact that if primary vcpu runs, the secondary vcpu can not be scheduled. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..38b1fc0 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1222,8 +1222,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) kvmppc_create_dtl_entry(vcpu, vc); } - vc-vcore_state = VCORE_RUNNING; preempt_disable(); + vc-vcore_state = VCORE_RUNNING; spin_unlock(vc-lock); kvm_guest_enter(); @@ -1351,12 +1351,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) * this thread straight away and have it join in. */ if (!signal_pending(current)) { - if (vc-vcore_state == VCORE_RUNNING - VCORE_EXIT_COUNT(vc) == 0) { - vcpu-arch.ptid = vc-n_runnable - 1; - kvmppc_create_dtl_entry(vcpu, vc); - kvmppc_start_thread(vcpu); - } else if (vc-vcore_state == VCORE_SLEEPING) { + if (vc-vcore_state == VCORE_SLEEPING) { wake_up(vc-wq); } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: IRQFD: equip irqfd and resamplefd with polarity
Nowadays, irqfd can emulate trigger mode, but it can not emulate trigger polarity. While in some cases, ioapic ioredtbl[x] expects _low_ active. So equipping irqfd with the ability. Correspondingly, resamplefd will have the same polarity as irqfd. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- This helps to step around making the interrupt component re-entrance in qemu --- include/linux/kvm_host.h | 1 + include/uapi/linux/kvm.h | 2 ++ virt/kvm/eventfd.c | 32 ++-- 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a63d83e..0b8c3b1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -678,6 +678,7 @@ bool kvm_is_mmio_pfn(pfn_t pfn); struct kvm_irq_ack_notifier { struct hlist_node link; unsigned gsi; + int polarity; /* 0 high active */ void (*irq_acked)(struct kvm_irq_ack_notifier *kian); }; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index acccd08..bba3a1b 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -740,6 +740,8 @@ struct kvm_xen_hvm_config { * emlation. See Documentation/virtual/kvm/api.txt. */ #define KVM_IRQFD_FLAG_RESAMPLE (1 1) +/* 0: high-active */ +#define KVM_IRQFD_FLAG_POLARITY (12) struct kvm_irqfd { __u32 fd; diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 1550637..865c656 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -77,6 +77,7 @@ struct _irqfd { struct kvm_kernel_irq_routing_entry __rcu *irq_entry; /* Used for level IRQ fast-path */ int gsi; + int polarity; /* 0 high active */ struct work_struct inject; /* The resampler used by this irqfd (resampler-only) */ struct _irqfd_resampler *resampler; @@ -100,13 +101,13 @@ irqfd_inject(struct work_struct *work) struct kvm *kvm = irqfd-kvm; if (!irqfd-resampler) { - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1, - false); - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0, - false); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, + !irqfd-polarity, false); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, + !!irqfd-polarity, false); } else kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID, - irqfd-gsi, 1, false); + irqfd-gsi, !irqfd-polarity, false); } /* @@ -123,7 +124,8 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian) resampler = container_of(kian, struct _irqfd_resampler, notifier); kvm_set_irq(resampler-kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID, - resampler-notifier.gsi, 0, false); + resampler-notifier.gsi, + !!resampler-notifier.polarity, false); rcu_read_lock(); @@ -148,7 +150,7 @@ irqfd_resampler_shutdown(struct _irqfd *irqfd) list_del(resampler-link); kvm_unregister_irq_ack_notifier(kvm, resampler-notifier); kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID, - resampler-notifier.gsi, 0, false); + resampler-notifier.gsi, !!irqfd-polarity, false); kfree(resampler); } @@ -302,6 +304,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) irqfd-kvm = kvm; irqfd-gsi = args-gsi; + irqfd-polarity = !!(args-flags KVM_IRQFD_FLAG_POLARITY); INIT_LIST_HEAD(irqfd-list); INIT_WORK(irqfd-inject, irqfd_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); @@ -337,8 +340,15 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) list_for_each_entry(resampler, kvm-irqfds.resampler_list, link) { if (resampler-notifier.gsi == irqfd-gsi) { - irqfd-resampler = resampler; - break; + if (likely(resampler-notifier.polarity == + irqfd-polarity)) { + irqfd-resampler = resampler; + break; + } else { + ret = -EBUSY; + mutex_unlock(kvm-irqfds.resampler_lock); + goto fail; + } } } @@ -353,6 +363,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) resampler-kvm = kvm; INIT_LIST_HEAD(resampler-list); resampler
Re: What's the usage model (purpose) of interrupt remapping in IOMMU?
On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson alex.william...@redhat.com wrote: On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote: Hi, In case of direct io, without the interrupt remapping in IOMMU (intel VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest when the guest is scheduled to specific CPU. At the beginning I thought with IOMMU's interrupt remapping, the hardware can directly forward the interrupt to guest without trapping into hypervisor when the interrupt happens, but after reading the Intel VT-d's manual, I found the interrupt mapping feature just add another mapping which allows software to control (mainly) the destination and vector, and we still need hypervisor to inject the interrupt when the guest is scheduled as only after the guest is scheduled, the target CPU can be known. If my understanding is correct, seems the interrupt remapping does not bring any performance improvement. So what's the benefit of IOMMU's interrupt remapping? Can someone explain the usage model of interrupt remapping in IOMMU? Interrupt remapping provides isolation and compatibility, not performance. The hypervisor being able to direct interrupts to a target CPU also allows it the ability to filter interrupts and prevent the device from signaling spurious or malicious interrupts. This is particularly important with message signaled interrupts since any device capable of DMA is able to inject random MSIs into the host. The compatibility side is a feature of Intel platforms supporting x2apic. The interrupt remapper provides a translation layer to allow xapic aware hardware, such as ioapics, to function when the processors are switched to x2apic mode. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's the usage model (purpose) of interrupt remapping in IOMMU?
On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson alex.william...@redhat.com wrote: On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote: Hi, In case of direct io, without the interrupt remapping in IOMMU (intel VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest when the guest is scheduled to specific CPU. At the beginning I thought with IOMMU's interrupt remapping, the hardware can directly forward the interrupt to guest without trapping into hypervisor when the interrupt happens, but after reading the Intel VT-d's manual, I found the interrupt mapping feature just add another mapping which allows software to control (mainly) the destination and vector, and we still need hypervisor to inject the interrupt when the guest is scheduled as only after the guest is scheduled, the target CPU can be known. If my understanding is correct, seems the interrupt remapping does not bring any performance improvement. So what's the benefit of IOMMU's interrupt remapping? Can someone explain the usage model of interrupt remapping in IOMMU? Interrupt remapping provides isolation and compatibility, not The guest can not directly program the msi-x on pci device, so msix is still under the control of host. Why do we need extra control introduced by iommu ? Thanks, Pingfan performance. The hypervisor being able to direct interrupts to a target CPU also allows it the ability to filter interrupts and prevent the device from signaling spurious or malicious interrupts. This is particularly important with message signaled interrupts since any device capable of DMA is able to inject random MSIs into the host. The compatibility side is a feature of Intel platforms supporting x2apic. The interrupt remapper provides a translation layer to allow xapic aware hardware, such as ioapics, to function when the processors are switched to x2apic mode. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's the usage model (purpose) of interrupt remapping in IOMMU?
On Wed, Aug 14, 2013 at 10:50 AM, Alex Williamson alex.william...@redhat.com wrote: On Wed, 2013-08-14 at 10:37 +0800, Liu ping fan wrote: On Wed, Nov 2, 2011 at 11:31 PM, Alex Williamson alex.william...@redhat.com wrote: On Wed, 2011-11-02 at 13:26 +0800, Kai Huang wrote: Hi, In case of direct io, without the interrupt remapping in IOMMU (intel VT-d or AMD IOMMU), hypervisor needs to inject interrupt for guest when the guest is scheduled to specific CPU. At the beginning I thought with IOMMU's interrupt remapping, the hardware can directly forward the interrupt to guest without trapping into hypervisor when the interrupt happens, but after reading the Intel VT-d's manual, I found the interrupt mapping feature just add another mapping which allows software to control (mainly) the destination and vector, and we still need hypervisor to inject the interrupt when the guest is scheduled as only after the guest is scheduled, the target CPU can be known. If my understanding is correct, seems the interrupt remapping does not bring any performance improvement. So what's the benefit of IOMMU's interrupt remapping? Can someone explain the usage model of interrupt remapping in IOMMU? Interrupt remapping provides isolation and compatibility, not The guest can not directly program the msi-x on pci device, so msix is still under the control of host. Why do we need extra control introduced by iommu ? An MSI interrupt is just a DMA write with a specific address and payload. Any device capable of DMA can theoretically inject an MSI interrupt using other means besides the MSI/MSI-X configuration areas. Thank you for the clear explanation. Thanks and regards, Pingfan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction
On Thu, Oct 25, 2012 at 1:16 AM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: Hi, On Wed, Oct 24, 2012 at 12:15:17PM +0200, Stefan Hajnoczi wrote: On Wed, Oct 24, 2012 at 10:06 AM, liu ping fan qemul...@gmail.com wrote: On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote: +static void dimm_populate(DimmDevice *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +} + +static void dimm_depopulate(DimmDevice *s) +{ +assert(s); +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-mr = NULL; +} How is dimm hot unplug protected against callers who currently have RAM mapped (from cpu_physical_memory_map())? Emulated devices call cpu_physical_memory_map() directly or indirectly through DMA emulation code. The RAM pointer may be held for arbitrary lengths of time, across main loop iterations, etc. It's not clear to me that it is safe to unplug a DIMM that has network or disk I/O buffers, for example. We also need to be robust against malicious guests who abuse the hotplug lifecycle. QEMU should never be left with dangling pointers. Not sure about the block layer. But I think those thread are already out of big lock, so there should be a MemoryListener to catch the RAM-unplug event, and if needed, bdrv_flush. do we want bdrv_flush, or some kind of cancel request e.g. bdrv_aio_cancel? My original meaning is that flush out the dangling pointer. Here is the detailed scenario: 1. Emulated device does cpu_physical_memory_map() and gets a pointer to guest RAM. 2. Return to vcpu or iothread, continue processing... 3. Hot unplug of RAM causes the guest RAM to disappear. 4. Pending I/O completes and overwrites memory from dangling guest RAM pointer. Any I/O device that does zero-copy I/O in QEMU faces this problem: * The block layer is affected. * The net layer is unaffected because it doesn't do zero-copy tx/rx across returns to the main loop (#2 above). * Not sure about other devices classes (e.g. USB). How should the MemoryListener callback work? For block I/O it may not be possible to cancel pending I/O asynchronously - if you try to cancel then your thread may block until the I/O completes. For current code, I think to block on the listener to wait for the completion of flushing out. But after mr-ops's ref/unref patchset accept, we can release the ref of RAM device after we have done with it (it is a very raw idea, need to improve). e.g. paio_cancel does this? is there already an API to asynchronously cancel all in flight operations in a BlockDriverState? Afaict block_job_cancel refers to streaming jobs only and doesn't help here. Can we make the RAM unplug initiate async I/O cancellations, prevent further I/Os, and only free the memory in a callback, after all DMA I/O to the associated memory region has been cancelled or completed? Also iiuc the MemoryListener should be registered from users of cpu_physical_memory_map e.g. hw/virtio.c Yes. By the way dimm_depopulate only frees the qemu memory on an ACPI _EJ request, which means that a well-behaved guest will have already offlined the memory and is not using it anymore. If the guest still uses the memory e.g. for a DMA buffer, the logical memory offlining will fail and the _EJ/qemu memory freeing will never happen. Yes. But in theory a malicious acpi guest driver could trigger _EJ requests to do step 3 above. Or perhaps the backing block driver can finish an I/O request for a zero-copy block device that the guest doesn't care for anymore? I 'll think about this a bit more. The guest is one of the users of dimm device, and block layer is another one. Regards, pingfan Synchronous cancel behavior is not workable since it can lead to poor latency or hangs in the guest. ok thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction
On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote: +static void dimm_populate(DimmDevice *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +} + +static void dimm_depopulate(DimmDevice *s) +{ +assert(s); +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-mr = NULL; +} How is dimm hot unplug protected against callers who currently have RAM mapped (from cpu_physical_memory_map())? Emulated devices call cpu_physical_memory_map() directly or indirectly through DMA emulation code. The RAM pointer may be held for arbitrary lengths of time, across main loop iterations, etc. It's not clear to me that it is safe to unplug a DIMM that has network or disk I/O buffers, for example. We also need to be robust against malicious guests who abuse the hotplug lifecycle. QEMU should never be left with dangling pointers. Not sure about the block layer. But I think those thread are already out of big lock, so there should be a MemoryListener to catch the RAM-unplug event, and if needed, bdrv_flush. Regards, pingfan Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Tuesday, September 4th
On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote: On 09/03/2012 09:44 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - protecting MemoryRegion::opaque during dispatch I'm guessing Ping won't make it due to timezone problems. Jan, if you will not participate, please remove the topic from the list (unless someone else wants to argue your side). Is there log for this topic? Link? Thanks, pingfan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Tuesday, September 4th
On Tue, Sep 4, 2012 at 4:21 PM, Avi Kivity a...@redhat.com wrote: On 09/04/2012 11:17 AM, liu ping fan wrote: On Mon, Sep 3, 2012 at 7:48 PM, Avi Kivity a...@redhat.com wrote: On 09/03/2012 09:44 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. - protecting MemoryRegion::opaque during dispatch I'm guessing Ping won't make it due to timezone problems. Jan, if you will not participate, please remove the topic from the list (unless someone else wants to argue your side). Is there log for this topic? Link? The call has not happened yet (it's in 5 hours 40 minutes), but Jan can't participate, so it will likely be cancelled. If you can participate, maybe we can have it anyway. Sorry, I can not attend. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
On Sat, Aug 11, 2012 at 9:58 AM, liu ping fan qemul...@gmail.com wrote: On Wed, Aug 8, 2012 at 5:41 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer IMO this cleverness should come much later. Let's first take care of dropping the big qemu lock, then make swithcing memory maps more efficient. The initial paths could look like: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock Later we can replace mem_map_lock with either a rwlock or (real) rcu. #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, + 16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } Please have a patch that just adds the map parameter to all these functions. This makes the later patch, that adds the copy, easier to read. + +void cur_map_update(PhysMap *next) +{ +qemu_mutex_lock(cur_map_lock); +physmap_put(cur_map); +cur_map = next; +smp_mb(); +qemu_mutex_unlock(cur_map_lock); +} IMO this can be mem_map_lock. If we take my previous suggestion: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock And update it to update: prepare next_map (in core_begin) do updates take mem_map_lock (in core_commit) switch maps drop mem_map_lock free old map Note the lookup path copies the MemoryRegionSection instead of referencing it. Thus we can destroy the old map without worrying; the only pointers will point to MemoryRegions, which will be protected by the refcounts on their Objects. Just find there may be a leak here. If mrs points to subpage, then the subpage_t could be crashed by destroy. To avoid such situation, we can walk down the chain to pin us on the Object based mr, but then we must expose the address convert in subpage_read() right here. Right? Oh, just read the code logic and I think walk down the chain is enough. And subpage_read/write() is bypass, so no need for fold the addr translation. Regards, pingfan Regards, pingfan This can be easily switched to rcu: update: prepare next_map (in core_begin) do updates switch maps - rcu_assign_pointer call_rcu(free old map) (or synchronize_rcu; free old maps) Again, this should be done after the simplictic patch that enables parallel lookup but keeps just one map. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
On Thu, Aug 9, 2012 at 4:00 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 09/08/2012 09:28, liu ping fan ha scritto: VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref dev - refcnt=0, free enqueued ref() No ref() for dev here, while we have ref to flatview+radix in my patches. I use rcu to protect radix+flatview+mr refered. As to dev, its ref has inc when it is added into mem view -- that is memory_region_add_subregion - memory_region_get() { if(atomic_add_and_return()) dev-ref++ }. So not until reclaimer of mem view, the dev's ref is hold by mem view. In a short word, rcu protect mem view, while device is protected by refcnt. But the RCU critical section should not include the whole processing of MMIO, only the walk of the memory map. Yes, you are right. And I think cur_map_get() can be broken into the style lock, ref++, phys_page_find(); unlock. easily. And in general I think this is a bit too tricky... I understand not adding refcounting to all of bottom halves, timers, etc., but if you are using a device you should have explicit ref/unref pairs. Actually, there are pairs -- when dev enter mem view, inc ref; and when it leave, dec ref. But as Avi has pointed out, the mr-refcnt introduce complicate and no gain. So I will discard this design Thanks and regards, pingfan Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] e1000: using new interface--unmap to unplug
On Thu, Aug 9, 2012 at 3:40 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 09/08/2012 09:28, liu ping fan ha scritto: +static void +pci_e1000_unmap(PCIDevice *p) +{ +/* DO NOT FREE anything!until refcnt=0 */ +/* isolate from memory view */ +} At least you need to call the superclass method. Refer to 0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch, we have the following sequence qdev_unmap-pci_unmap_device-pci_e1000_unmap. So pci_e1000_unmap need not to do anything. But then this patch is unnecessary, isn't it? Yes, should remove from next version Regards, pingfan Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
On Thu, Aug 9, 2012 at 4:18 PM, Avi Kivity a...@redhat.com wrote: On 08/09/2012 10:49 AM, Paolo Bonzini wrote: Il 09/08/2012 09:33, liu ping fan ha scritto: Yes, it is to defer destructors. See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch When MemoryRegion is _del_subregion from mem in updater, it may be still in use by reader -- radix or flatview, so defer its destructors to the reclaimer --phys_map_release(PhysMap *map) How are you sure that the reader is already out of its critical section by the time the reclaimer runs? If we have rcu, it could be elegant to do this. Yeah, I think inventing primitives is dangerous and difficult to review; and it may be difficult to replace it with proper call_rcu. You should probably make a proof-of-concept using liburcu. Then we can decide how to implement RCU in a way that is portable enough for QEMU's needs. IMO we should start with a simple mutex (which will cover only the lookup and map rebuild). This should reduce the contention to basically nothing (still leaving a cache line bounce). If a profile shows the cache line bounce hurting us, or perhaps contention in ultralarge guests, then we should switch to rcu. Agree, I think this will pin us on the major issue -- mmio perfermance Regards, pingfan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion
On Thu, Aug 9, 2012 at 4:38 PM, Avi Kivity a...@redhat.com wrote: On 08/09/2012 10:27 AM, liu ping fan wrote: On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using refcnt for mr, so we can separate mr's life cycle management from refered object. When mr-ref 0-1, inc the refered object. When mr-ref 1-0, dec the refered object. The refered object can be DeviceStae, another mr, or other opaque. Please explain the motivation more fully. Actually, the aim is to mange the reference of an object, used by mem view. DeviceState can be referred by different system, when it comes to the view of subsystem, we hold dev's ref. And any indirect reference will just mr-ref++, not dev's. This can help us avoid the down-walk through the referred chain, like alias mr --- DeviceState. That is a lot of complexity, for no gain. Manipulating memory regions is a slow path, and can be done under the bit qemu lock without any complications. OK. I will discard this design. In the previous discussion, you have suggest add dev-ref++ in core_region_add. But I think, if we can move it to higher layer -- memory_region_{add,del}_subregion, so we can avoid to duplicate do this in other xx_region_add. Why would other memory listeners be impacted? They all operate under the big qemu lock. If they start using devices outside the lock, then they need to take a reference. Yes, if unplug path in the protection of big lock. And just one extra question, for ram-unplug scene, how do we protect from: updater: ram-unplug --qemu free() -- brk() invalidate this vaddr interval reader: vhost-thread copy data from the interval I guess something like lock/ref used by them, but can not find such mechanism in vhost_set_memory() to protect the scene against vhost_worker() Thanks and regards, pingfan As a payment for this, we need to handle alias which can be avoid at core_region_add(). And mr's ref can help to avoid the down-walk. The payment is two systems of reference counts. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
On Wed, Aug 8, 2012 at 5:41 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer IMO this cleverness should come much later. Let's first take care of dropping the big qemu lock, then make swithcing memory maps more efficient. The initial paths could look like: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock Later we can replace mem_map_lock with either a rwlock or (real) rcu. #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } Please have a patch that just adds the map parameter to all these functions. This makes the later patch, that adds the copy, easier to read. + +void cur_map_update(PhysMap *next) +{ +qemu_mutex_lock(cur_map_lock); +physmap_put(cur_map); +cur_map = next; +smp_mb(); +qemu_mutex_unlock(cur_map_lock); +} IMO this can be mem_map_lock. If we take my previous suggestion: lookup: take mem_map_lock lookup take ref drop mem_map_lock update: take mem_map_lock (in core_begin) do updates drop memo_map_lock And update it to update: prepare next_map (in core_begin) do updates take mem_map_lock (in core_commit) switch maps drop mem_map_lock free old map Note the lookup path copies the MemoryRegionSection instead of referencing it. Thus we can destroy the old map without worrying; the only pointers will point to MemoryRegions, which will be protected by the refcounts on their Objects. Just find there may be a leak here. If mrs points to subpage, then the subpage_t could be crashed by destroy. To avoid such situation, we can walk down the chain to pin us on the Object based mr, but then we must expose the address convert in subpage_read() right here. Right? Regards, pingfan This can be easily switched to rcu: update: prepare next_map (in core_begin) do updates switch maps - rcu_assign_pointer call_rcu(free old map) (or synchronize_rcu; free old maps) Again, this should be done after the simplictic patch that enables parallel lookup but keeps just one map. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion
On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using refcnt for mr, so we can separate mr's life cycle management from refered object. When mr-ref 0-1, inc the refered object. When mr-ref 1-0, dec the refered object. The refered object can be DeviceStae, another mr, or other opaque. Please explain the motivation more fully. Actually, the aim is to mange the reference of an object, used by mem view. DeviceState can be referred by different system, when it comes to the view of subsystem, we hold dev's ref. And any indirect reference will just mr-ref++, not dev's. This can help us avoid the down-walk through the referred chain, like alias mr --- DeviceState. In the previous discussion, you have suggest add dev-ref++ in core_region_add. But I think, if we can move it to higher layer -- memory_region_{add,del}_subregion, so we can avoid to duplicate do this in other xx_region_add. As a payment for this, we need to handle alias which can be avoid at core_region_add(). And mr's ref can help to avoid the down-walk. Regards, pingfan Usually a MemoryRegion will be embedded within some DeviceState, or its lifecycle will be managed by the DeviceState. So long as we keep the DeviceState alive all associated MemoryRegions should be alive as well. Why not do this directly? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/15] lock: introduce global lock for device tree
On Wed, Aug 8, 2012 at 5:42 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Please explain the motivation. AFAICT, the big qemu lock is sufficient. Oh, this is one of the series locks for the removal of big qemu lock. The degradation of big lock will take several steps, including to introduce device's private lock. Till then, when the device add path from iothread and the remove path in io-dispatch is out of the big qemu lock. We need this extra lock. These series is too big, so I send out the 1st phase for review. Regards, pingan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/15] lock: introduce global lock for device tree
On Wed, Aug 8, 2012 at 5:41 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- cpus.c | 12 main-loop.h |3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..a734b36 100644 --- a/cpus.c +++ b/cpus.c @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void) } #endif /* _WIN32 */ +QemuMutex qemu_device_tree_mutex; QemuMutex qemu_global_mutex; static QemuCond qemu_io_proceeded_cond; static bool iothread_requesting_mutex; @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void) qemu_cond_init(qemu_work_cond); qemu_cond_init(qemu_io_proceeded_cond); qemu_mutex_init(qemu_global_mutex); +qemu_mutex_init(qemu_device_tree_mutex); qemu_thread_get_self(io_thread); } @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void) qemu_mutex_unlock(qemu_global_mutex); } +void qemu_lock_devtree(void) +{ +qemu_mutex_lock(qemu_device_tree_mutex); +} + +void qemu_unlock_devtree(void) +{ +qemu_mutex_unlock(qemu_device_tree_mutex); +} We don't need the wrappers. They are there for the big lock just because TCG needs extra work for iothread_requesting_mutex. Sorry, could you give more detail about TCG, what is extra work. Thanks, pingfan Paolo static int all_vcpus_paused(void) { CPUArchState *penv = first_cpu; diff --git a/main-loop.h b/main-loop.h index dce1cd9..17e959a 100644 --- a/main-loop.h +++ b/main-loop.h @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void); */ void qemu_mutex_unlock_iothread(void); +void qemu_lock_devtree(void); +void qemu_unlock_devtree(void); + /* internal interfaces */ void qemu_fd_register(int fd); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] e1000: using new interface--unmap to unplug
On Wed, Aug 8, 2012 at 5:56 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 08/08/2012 08:25, Liu Ping Fan ha scritto: +static void +pci_e1000_unmap(PCIDevice *p) +{ +/* DO NOT FREE anything!until refcnt=0 */ +/* isolate from memory view */ +} At least you need to call the superclass method. Refer to 0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch, we have the following sequence qdev_unmap-pci_unmap_device-pci_e1000_unmap. So pci_e1000_unmap need not to do anything. Regards, pingfan Paolo static int pci_e1000_uninit(PCIDevice *dev) { @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void *data) PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k-init = pci_e1000_init; +k-unmap = pci_e1000_unmap; k-exit = pci_e1000_uninit; k-romfile = pxe-e1000.rom; k-vendor_id = PCI_VENDOR_ID_INTEL; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
On Wed, Aug 8, 2012 at 5:52 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 08/08/2012 08:25, Liu Ping Fan ha scritto: +void qdev_unplug_complete(DeviceState *dev, Error **errp) +{ +/* isolate from mem view */ +qdev_unmap(dev); +qemu_lock_devtree(); +/* isolate from device tree */ +qdev_unset_parent(dev); +qemu_unlock_devtree(); +object_unref(OBJECT(dev)); Rather than deferring the free, you should defer the unref. Otherwise the following can happen when you have real RCU access to the memory map on the read-side: VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref dev - refcnt=0, free enqueued ref() No ref() for dev here, while we have ref to flatview+radix in my patches. I use rcu to protect radix+flatview+mr refered. As to dev, its ref has inc when it is added into mem view -- that is memory_region_add_subregion - memory_region_get() { if(atomic_add_and_return()) dev-ref++ }. So not until reclaimer of mem view, the dev's ref is hold by mem view. In a short word, rcu protect mem view, while device is protected by refcnt. rcu_read_unlock() free() dangling pointer! If you defer the unref, you have instead VCPU threadI/O thread = get MMIO request rcu_read_lock() walk memory map qdev_unmap() lock_devtree() ... unlock_devtree unref is enqueued ref() - refcnt = 2 rcu_read_unlock() unref() - refcnt=1 unref() - refcnt = 1 So this also makes patch 14 unnecessary. Paolo +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating
On Thu, Aug 9, 2012 at 3:17 AM, Blue Swirl blauwir...@gmail.com wrote: On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); I'd move this and the mutex to memory.c since there are no other uses. The mutex could be static then. But the init entry is in exec.c, not memory.c. Regards, pingfan #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) { uint8_t mask = 1 client; +qemu_mutex_lock(mem_map_lock); mr-dirty_log_mask = (mr-dirty_log_mask ~mask) | (log * mask); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr, @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr) void memory_region_set_readonly(MemoryRegion *mr, bool readonly) { if (mr-readonly != readonly) { +qemu_mutex_lock(mem_map_lock); mr-readonly = readonly; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable) { if (mr-readable != readable) { mr-readable = readable; +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) { break; @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i)); mr-ioeventfds[i] = mrfd; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_del_eventfd(MemoryRegion *mr, @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) { break; @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, mr-ioeventfds = g_realloc(mr-ioeventfds, sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 1); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } static void memory_region_add_subregion_common(MemoryRegion *mr, @@ -1259,6 +1271,8 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; + +qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { continue; @@ -1289,6 +1303,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link); done: memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr, { assert(subregion-parent == mr); subregion-parent = NULL; + +qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock
Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating
On Wed, Aug 8, 2012 at 5:13 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 09:25 AM, Liu Ping Fan wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } Seems to me that nothing in memory.c can susceptible to races. It must already be called under the big qemu lock, and with the exception of mutators (memory_region_set_*), changes aren't directly visible. Yes, what I want to do is prepare unplug out of protection of global lock. When io-dispatch and mmio-dispatch are all out of big lock, we will run into the following scene: In vcpu context A, qdev_unplug_complete()- delete subregion; In context B, write pci bar -- pci mapping update- add subregion I think it's sufficient to take the mem_map_lock at the beginning of core_begin() and drop it at the end of core_commit(). That means all updates of volatile state, phys_map, are protected. The mem_map_lock is to protect both address_space_io and address_space_memory. When without the protection of big lock, competing will raise among the updaters (memory_region_{add,del}_subregion and the readers generate_memory_topology()-render_memory_region(). If just in core_begin/commit, we will duplicate it for xx_begin/commit, right? And at the same time, mr-subregions is exposed under SMP without big lock. Thanks and regards, pingfan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy
On Thu, Aug 9, 2012 at 3:18 AM, Blue Swirl blauwir...@gmail.com wrote: On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com PhysMap contain the flatview and radix-tree view, they are snapshot of system topology and should be consistent. With PhysMap, we can swap the pointer when updating and achieve the atomic. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |8 memory.c | 33 - memory.h | 62 -- 3 files changed, 60 insertions(+), 43 deletions(-) diff --git a/exec.c b/exec.c index 0e29ef9..01b91b0 100644 --- a/exec.c +++ b/exec.c @@ -156,8 +156,6 @@ typedef struct PageDesc { #endif /* Size of the L2 (and L3, etc) page tables. */ Please copy this comment to the header file. OK, thanks. pingfan -#define L2_BITS 10 -#define L2_SIZE (1 L2_BITS) #define P_L2_LEVELS \ (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1) @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask; static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -typedef struct PhysPageEntry PhysPageEntry; static MemoryRegionSection *phys_sections; static unsigned phys_sections_nb, phys_sections_nb_alloc; @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; -struct PhysPageEntry { -uint16_t is_leaf : 1; - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ -uint16_t ptr : 15; -}; /* Simple allocator for PhysPageEntry nodes */ static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; diff --git a/memory.c b/memory.c index 2eaa2fc..c7f2cfd 100644 --- a/memory.c +++ b/memory.c @@ -31,17 +31,6 @@ static bool global_dirty_log = false; static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners = QTAILQ_HEAD_INITIALIZER(memory_listeners); -typedef struct AddrRange AddrRange; - -/* - * Note using signed integers limits us to physical addresses at most - * 63 bits wide. They are needed for negative offsetting in aliases - * (large MemoryRegion::alias_offset). - */ -struct AddrRange { -Int128 start; -Int128 size; -}; static AddrRange addrrange_make(Int128 start, Int128 size) { @@ -197,28 +186,6 @@ static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd a, !memory_region_ioeventfd_before(b, a); } -typedef struct FlatRange FlatRange; -typedef struct FlatView FlatView; - -/* Range of memory in the global map. Addresses are absolute. */ -struct FlatRange { -MemoryRegion *mr; -target_phys_addr_t offset_in_region; -AddrRange addr; -uint8_t dirty_log_mask; -bool readable; -bool readonly; -}; - -/* Flattened global view of current active memory hierarchy. Kept in sorted - * order. - */ -struct FlatView { -FlatRange *ranges; -unsigned nr; -unsigned nr_allocated; -}; - typedef struct AddressSpace AddressSpace; typedef struct AddressSpaceOps AddressSpaceOps; diff --git a/memory.h b/memory.h index 740f018..357edd8 100644 --- a/memory.h +++ b/memory.h @@ -29,12 +29,72 @@ #include qemu-thread.h #include qemu/reclaimer.h +typedef struct AddrRange AddrRange; +typedef struct FlatRange FlatRange; +typedef struct FlatView FlatView; +typedef struct PhysPageEntry PhysPageEntry; +typedef struct PhysMap PhysMap; +typedef struct MemoryRegionSection MemoryRegionSection; typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; typedef struct MemoryRegion MemoryRegion; typedef struct MemoryRegionPortio MemoryRegionPortio; typedef struct MemoryRegionMmio MemoryRegionMmio; +/* + * Note using signed integers limits us to physical addresses at most + * 63 bits wide. They are needed for negative offsetting in aliases + * (large MemoryRegion::alias_offset). + */ +struct AddrRange { +Int128 start; +Int128 size; +}; + +/* Range of memory in the global map. Addresses are absolute. */ +struct FlatRange { +MemoryRegion *mr; +target_phys_addr_t offset_in_region; +AddrRange addr; +uint8_t dirty_log_mask; +bool readable; +bool readonly; +}; + +/* Flattened global view of current active memory hierarchy. Kept in sorted + * order. + */ +struct FlatView { +FlatRange *ranges; +unsigned nr; +unsigned nr_allocated; +}; + +struct PhysPageEntry { +uint16_t is_leaf:1; + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ +uint16_t ptr:15; +}; + +#define L2_BITS 10 +#define L2_SIZE (1 L2_BITS) +/* This is a multi-level map on the physical address space. + The bottom level has pointers to MemoryRegionSections. */ +struct PhysMap { +Atomic ref; +PhysPageEntry root; +PhysPageEntry
Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
On Thu, Aug 9, 2012 at 3:23 AM, Blue Swirl blauwir...@gmail.com wrote: On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 303 +++--- hw/vhost.c |2 +- hw/xen_pt.c |2 +- kvm-all.c |2 +- memory.c| 92 ++- memory.h|9 ++- vl.c|1 + xen-all.c |2 +- 8 files changed, 286 insertions(+), 127 deletions(-) diff --git a/exec.c b/exec.c index 01b91b0..97addb9 100644 --- a/exec.c +++ b/exec.c @@ -24,6 +24,7 @@ #include sys/mman.h #endif +#include qemu/atomic.h #include qemu-common.h #include cpu.h #include tcg.h @@ -35,6 +36,8 @@ #include qemu-timer.h #include memory.h #include exec-memory.h +#include qemu-thread.h +#include qemu/reclaimer.h #if defined(CONFIG_USER_ONLY) #include qemu.h #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -static MemoryRegionSection *phys_sections; -static unsigned phys_sections_nb, phys_sections_nb_alloc; static uint16_t phys_section_unassigned; static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; - -/* Simple allocator for PhysPageEntry nodes */ -static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; - #define PHYS_MAP_NODE_NIL (((uint16_t)~0) 1) -/* This is a multi-level map on the physical address space. - The bottom level has pointers to MemoryRegionSections. */ -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; - +static QemuMutex cur_map_lock; +static PhysMap *cur_map; QemuMutex mem_map_lock; +static PhysMap *next_map; static void io_mem_init(void); static void memory_map_init(void); @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index) #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } -static uint16_t phys_map_node_alloc(void) +static uint16_t phys_map_node_alloc(PhysMap *map) { unsigned i; uint16_t ret; -ret = phys_map_nodes_nb++; +ret = map-phys_map_nodes_nb++; assert(ret != PHYS_MAP_NODE_NIL); -assert(ret != phys_map_nodes_nb_alloc); +assert(ret != map-phys_map_nodes_nb_alloc); for (i = 0; i L2_SIZE; ++i) { -phys_map_nodes[ret][i].is_leaf = 0; -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; +map-phys_map_nodes[ret][i].is_leaf = 0; +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; } return ret; } -static void phys_map_nodes_reset(void) -{ -phys_map_nodes_nb = 0; -} - - -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, -target_phys_addr_t *nb, uint16_t leaf, +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp, +target_phys_addr_t *index, +target_phys_addr_t *nb, +uint16_t leaf, int level) { PhysPageEntry *p; @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, target_phys_addr_t step = (target_phys_addr_t)1 (level * L2_BITS); if (!lp-is_leaf lp-ptr == PHYS_MAP_NODE_NIL) { -lp-ptr = phys_map_node_alloc(); -p = phys_map_nodes[lp-ptr]; +lp-ptr = phys_map_node_alloc(map); +p = map
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
On Wed, Aug 8, 2012 at 5:15 PM, Avi Kivity a...@redhat.com wrote: On 08/08/2012 12:07 PM, Paolo Bonzini wrote: Il 08/08/2012 11:05, Avi Kivity ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Please explain the motivation for this patch. It's poor man RCU, I think? I thought that it was to defer destructors (finalizers) to a more suitable context. But why is the unref context unsuitable? Yes, it is to defer destructors. See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch When MemoryRegion is _del_subregion from mem in updater, it may be still in use by reader -- radix or flatview, so defer its destructors to the reclaimer --phys_map_release(PhysMap *map) If we have rcu, it could be elegant to do this. I think, I should write the commit comment here too, not until the followed patch. Regards, pingfan I don't see how it relates to RCU, where is the C and the U? Anyway the list eagerly awaits the explanation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/15] qom: introduce reclaimer to release obj
On Wed, Aug 8, 2012 at 5:35 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 08/08/2012 08:25, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/reclaimer.h | 28 ++ main-loop.c |5 qemu-tool.c |5 qom/Makefile.objs|2 +- qom/reclaimer.c | 58 ++ 5 files changed, 97 insertions(+), 1 deletions(-) create mode 100644 include/qemu/reclaimer.h create mode 100644 qom/reclaimer.c diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h new file mode 100644 index 000..9307e93 --- /dev/null +++ b/include/qemu/reclaimer.h @@ -0,0 +1,28 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_RECLAIMER +#define QEMU_RECLAIMER + +typedef void ReleaseHandler(void *opaque); +typedef struct Chunk { +QLIST_ENTRY(Chunk) list; +void *opaque; +ReleaseHandler *release; +} Chunk; + +typedef struct ChunkHead { +struct Chunk *lh_first; +} ChunkHead; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release); +void reclaimer_worker(ChunkHead *head); +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release); +void qemu_reclaimer(void); So enqueue is call_rcu and qemu_reclaimer marks a quiescent state + empties the pending call_rcu. But what's the difference between the two pairs of APIs? I add the new one for V2 to reclaim the resource for mem view. And yes, as you point out, I will delete the 2nd API, for it can be substituted by the 1st one easily. +#endif diff --git a/main-loop.c b/main-loop.c index eb3b6e6..be9d095 100644 --- a/main-loop.c +++ b/main-loop.c @@ -26,6 +26,7 @@ #include qemu-timer.h #include slirp/slirp.h #include main-loop.h +#include qemu/reclaimer.h #ifndef _WIN32 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking) them. */ qemu_bh_poll(); +/* ref to device from iohandler/bh/timer do not obey the rules, so delay + * reclaiming until now. + */ +qemu_reclaimer(); return ret; } diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..f5fe319 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -21,6 +21,7 @@ #include main-loop.h #include qemu_socket.h #include slirp/libslirp.h +#include qemu/reclaimer.h #include sys/time.h @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void) { } +void qemu_reclaimer(void) +{ +} + int use_icount; void qemu_clock_warp(QEMUClock *clock) diff --git a/qom/Makefile.objs b/qom/Makefile.objs index 5ef060a..a579261 100644 --- a/qom/Makefile.objs +++ b/qom/Makefile.objs @@ -1,4 +1,4 @@ -qom-obj-y = object.o container.o qom-qobject.o +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o qom-obj-twice-y = cpu.o common-obj-y = $(qom-obj-twice-y) user-obj-y = $(qom-obj-twice-y) diff --git a/qom/reclaimer.c b/qom/reclaimer.c new file mode 100644 index 000..6cb53e3 --- /dev/null +++ b/qom/reclaimer.c @@ -0,0 +1,58 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-thread.h +#include main-loop.h +#include qemu-queue.h +#include qemu/reclaimer.h + +static struct QemuMutex reclaimer_lock; +static QLIST_HEAD(rcl, Chunk) reclaimer_list; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +QLIST_INSERT_HEAD_RCU(head, r, list); +} No lock? Yes, need! I will think it more closely. Thanks and regards, pingfan +void reclaimer_worker(ChunkHead *head) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, head, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} QLIST_REMOVE needs a lock too, so using the lockless QLIST_INSERT_HEAD_RCU is not necessary. +} + +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +qemu_mutex_lock(reclaimer_lock); +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list); +qemu_mutex_unlock(reclaimer_lock); +} + + +void qemu_reclaimer(void) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} Same here
[PATCH 02/15] qom: using atomic ops to re-implement object_ref
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/object.h |3 ++- qom/object.c | 13 + 2 files changed, 7 insertions(+), 9 deletions(-) diff --git a/include/qemu/object.h b/include/qemu/object.h index 8b17776..58db9d0 100644 --- a/include/qemu/object.h +++ b/include/qemu/object.h @@ -18,6 +18,7 @@ #include stdint.h #include stdbool.h #include qemu-queue.h +#include qemu/atomic.h struct Visitor; struct Error; @@ -262,7 +263,7 @@ struct Object ObjectClass *class; GSList *interfaces; QTAILQ_HEAD(, ObjectProperty) properties; -uint32_t ref; +Atomic ref; Object *parent; }; diff --git a/qom/object.c b/qom/object.c index 00bb3b0..822bdb7 100644 --- a/qom/object.c +++ b/qom/object.c @@ -378,7 +378,7 @@ void object_finalize(void *data) object_deinit(obj, ti); object_property_del_all(obj); -g_assert(obj-ref == 0); +g_assert(atomic_read(obj-ref) == 0); } Object *object_new_with_type(Type type) @@ -405,7 +405,7 @@ Object *object_new(const char *typename) void object_delete(Object *obj) { object_unref(obj); -g_assert(obj-ref == 0); +g_assert(atomic_read(obj-ref) == 0); g_free(obj); } @@ -639,16 +639,13 @@ GSList *object_class_get_list(const char *implements_type, void object_ref(Object *obj) { -obj-ref++; +atomic_inc(obj-ref); } void object_unref(Object *obj) { -g_assert(obj-ref 0); -obj-ref--; - -/* parent always holds a reference to its children */ -if (obj-ref == 0) { +g_assert(atomic_read(obj-ref) 0); +if (atomic_dec_and_test(obj-ref)) { object_finalize(obj); } } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/15] qom: introduce reclaimer to release obj
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Collect unused object and release them at caller demand. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/reclaimer.h | 28 ++ main-loop.c |5 qemu-tool.c |5 qom/Makefile.objs|2 +- qom/reclaimer.c | 58 ++ 5 files changed, 97 insertions(+), 1 deletions(-) create mode 100644 include/qemu/reclaimer.h create mode 100644 qom/reclaimer.c diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h new file mode 100644 index 000..9307e93 --- /dev/null +++ b/include/qemu/reclaimer.h @@ -0,0 +1,28 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef QEMU_RECLAIMER +#define QEMU_RECLAIMER + +typedef void ReleaseHandler(void *opaque); +typedef struct Chunk { +QLIST_ENTRY(Chunk) list; +void *opaque; +ReleaseHandler *release; +} Chunk; + +typedef struct ChunkHead { +struct Chunk *lh_first; +} ChunkHead; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release); +void reclaimer_worker(ChunkHead *head); +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release); +void qemu_reclaimer(void); +#endif diff --git a/main-loop.c b/main-loop.c index eb3b6e6..be9d095 100644 --- a/main-loop.c +++ b/main-loop.c @@ -26,6 +26,7 @@ #include qemu-timer.h #include slirp/slirp.h #include main-loop.h +#include qemu/reclaimer.h #ifndef _WIN32 @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking) them. */ qemu_bh_poll(); +/* ref to device from iohandler/bh/timer do not obey the rules, so delay + * reclaiming until now. + */ +qemu_reclaimer(); return ret; } diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..f5fe319 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -21,6 +21,7 @@ #include main-loop.h #include qemu_socket.h #include slirp/libslirp.h +#include qemu/reclaimer.h #include sys/time.h @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void) { } +void qemu_reclaimer(void) +{ +} + int use_icount; void qemu_clock_warp(QEMUClock *clock) diff --git a/qom/Makefile.objs b/qom/Makefile.objs index 5ef060a..a579261 100644 --- a/qom/Makefile.objs +++ b/qom/Makefile.objs @@ -1,4 +1,4 @@ -qom-obj-y = object.o container.o qom-qobject.o +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o qom-obj-twice-y = cpu.o common-obj-y = $(qom-obj-twice-y) user-obj-y = $(qom-obj-twice-y) diff --git a/qom/reclaimer.c b/qom/reclaimer.c new file mode 100644 index 000..6cb53e3 --- /dev/null +++ b/qom/reclaimer.c @@ -0,0 +1,58 @@ +/* + * QEMU reclaimer + * + * Copyright IBM, Corp. 2012 + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include qemu-common.h +#include qemu-thread.h +#include main-loop.h +#include qemu-queue.h +#include qemu/reclaimer.h + +static struct QemuMutex reclaimer_lock; +static QLIST_HEAD(rcl, Chunk) reclaimer_list; + +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +QLIST_INSERT_HEAD_RCU(head, r, list); +} + +void reclaimer_worker(ChunkHead *head) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, head, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} +} + +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release) +{ +Chunk *r = g_malloc0(sizeof(Chunk)); +r-opaque = opaque; +r-release = release; +qemu_mutex_lock(reclaimer_lock); +QLIST_INSERT_HEAD_RCU(reclaimer_list, r, list); +qemu_mutex_unlock(reclaimer_lock); +} + + +void qemu_reclaimer(void) +{ +Chunk *cur, *next; + +QLIST_FOREACH_SAFE(cur, reclaimer_list, list, next) { +QLIST_REMOVE(cur, list); +cur-release(cur-opaque); +g_free(cur); +} +} -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/15] memory: MemoryRegion topology must be stable when updating
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using mem_map_lock to protect among updaters. So we can get the intact snapshot of mem topology -- FlatView radix-tree. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |3 +++ memory.c | 22 ++ memory.h |2 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index 8244d54..0e29ef9 100644 --- a/exec.c +++ b/exec.c @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; The bottom level has pointers to MemoryRegionSections. */ static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; +QemuMutex mem_map_lock; + static void io_mem_init(void); static void memory_map_init(void); @@ -637,6 +639,7 @@ void cpu_exec_init_all(void) #if !defined(CONFIG_USER_ONLY) memory_map_init(); io_mem_init(); +qemu_mutex_init(mem_map_lock); #endif } diff --git a/memory.c b/memory.c index aab4a31..5986532 100644 --- a/memory.c +++ b/memory.c @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; if (!memory_region_transaction_depth memory_region_update_pending) { +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } } @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) { uint8_t mask = 1 client; +qemu_mutex_lock(mem_map_lock); mr-dirty_log_mask = (mr-dirty_log_mask ~mask) | (log * mask); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr, @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion *mr) void memory_region_set_readonly(MemoryRegion *mr, bool readonly) { if (mr-readonly != readonly) { +qemu_mutex_lock(mem_map_lock); mr-readonly = readonly; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1112,7 +1118,9 @@ void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable) { if (mr-readable != readable) { mr-readable = readable; +qemu_mutex_lock(mem_map_lock); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } } @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_before(mrfd, mr-ioeventfds[i])) { break; @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, sizeof(*mr-ioeventfds) * (mr-ioeventfd_nb-1 - i)); mr-ioeventfds[i] = mrfd; memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_del_eventfd(MemoryRegion *mr, @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, }; unsigned i; +qemu_mutex_lock(mem_map_lock); for (i = 0; i mr-ioeventfd_nb; ++i) { if (memory_region_ioeventfd_equal(mrfd, mr-ioeventfds[i])) { break; @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, mr-ioeventfds = g_realloc(mr-ioeventfds, sizeof(*mr-ioeventfds)*mr-ioeventfd_nb + 1); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } static void memory_region_add_subregion_common(MemoryRegion *mr, @@ -1259,6 +1271,8 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; + +qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { continue; @@ -1289,6 +1303,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, QTAILQ_INSERT_TAIL(mr-subregions, subregion, subregions_link); done: memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } @@ -1316,8 +1331,11 @@ void memory_region_del_subregion(MemoryRegion *mr, { assert(subregion-parent == mr); subregion-parent = NULL; + +qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +qemu_mutex_unlock(mem_map_lock); } void memory_region_set_enabled(MemoryRegion *mr, bool enabled) @@ -1325,8 +1343,10 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled) if (enabled == mr-enabled) { return; } +qemu_mutex_lock(mem_map_lock); mr-enabled = enabled; memory_region_update_topology(NULL); +qemu_mutex_unlock(mem_map_lock); } void memory_region_set_address(MemoryRegion *mr, target_phys_addr_t
[PATCH 0/15 v2] prepare unplug out of protection of global lock
background: refer to orignal plan posted by Marcelo Tosatti, http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html prev version: https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03312.html changes v1-v2 --introduce atomic ops --introduce ref cnt for MemoryRegion --make memory's flat view and radix tree toward rcu style. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/15] memory: introduce life_ops to MemoryRegion
From: Liu Ping Fan pingf...@linux.vnet.ibm.com The types of referred object by MemoryRegion are variable, ex, another mr, DeviceState, or other struct defined by drivers. So the refer/unrefer may be different by drivers. Using this ops, we can mange the backend object. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/ide/piix.c |6 ++-- hw/pckbd.c|6 +++- hw/serial.c |2 +- ioport.c |3 +- memory.c | 69 - memory.h | 16 + 6 files changed, 94 insertions(+), 8 deletions(-) diff --git a/hw/ide/piix.c b/hw/ide/piix.c index f5a74c2..bdd70b1 100644 --- a/hw/ide/piix.c +++ b/hw/ide/piix.c @@ -93,11 +93,11 @@ static void bmdma_setup_bar(PCIIDEState *d) for(i = 0;i 2; i++) { BMDMAState *bm = d-bmdma[i]; -memory_region_init_io(bm-extra_io, piix_bmdma_ops, bm, +memory_region_init_io_ext(bm-extra_io, piix_bmdma_ops, NULL, bm, piix-bmdma, 4); memory_region_add_subregion(d-bmdma_bar, i * 8, bm-extra_io); -memory_region_init_io(bm-addr_ioport, bmdma_addr_ioport_ops, bm, - bmdma, 4); +memory_region_init_io_ext(bm-addr_ioport, bmdma_addr_ioport_ops, + NULL, bm, bmdma, 4); memory_region_add_subregion(d-bmdma_bar, i * 8 + 4, bm-addr_ioport); } } diff --git a/hw/pckbd.c b/hw/pckbd.c index 69857ba..de3c46d 100644 --- a/hw/pckbd.c +++ b/hw/pckbd.c @@ -485,10 +485,12 @@ static int i8042_initfn(ISADevice *dev) isa_init_irq(dev, s-irq_kbd, 1); isa_init_irq(dev, s-irq_mouse, 12); -memory_region_init_io(isa_s-io + 0, i8042_data_ops, s, i8042-data, 1); +memory_region_init_io_ext(isa_s-io + 0, i8042_data_ops, NULL, s, +i8042-data, 1); isa_register_ioport(dev, isa_s-io + 0, 0x60); -memory_region_init_io(isa_s-io + 1, i8042_cmd_ops, s, i8042-cmd, 1); +memory_region_init_io_ext(isa_s-io + 1, i8042_cmd_ops, NULL, s, +i8042-cmd, 1); isa_register_ioport(dev, isa_s-io + 1, 0x64); s-kbd = ps2_kbd_init(kbd_update_kbd_irq, s); diff --git a/hw/serial.c b/hw/serial.c index a421d1e..e992c6a 100644 --- a/hw/serial.c +++ b/hw/serial.c @@ -794,7 +794,7 @@ static int serial_isa_initfn(ISADevice *dev) serial_init_core(s); qdev_set_legacy_instance_id(dev-qdev, isa-iobase, 3); -memory_region_init_io(s-io, serial_io_ops, s, serial, 8); +memory_region_init_io_ext(s-io, serial_io_ops, NULL, s, serial, 8); isa_register_ioport(dev, s-io, isa-iobase); return 0; } diff --git a/ioport.c b/ioport.c index 6e4ca0d..768e271 100644 --- a/ioport.c +++ b/ioport.c @@ -384,7 +384,8 @@ static void portio_list_add_1(PortioList *piolist, * Use an alias so that the callback is called with an absolute address, * rather than an offset relative to to start + off_low. */ -memory_region_init_io(region, ops, piolist-opaque, piolist-name, +memory_region_init_io_ext(region, ops, NULL, piolist-opaque, + piolist-name, INT64_MAX); memory_region_init_alias(alias, piolist-name, region, start + off_low, off_high - off_low); diff --git a/memory.c b/memory.c index 5986532..80c7529 100644 --- a/memory.c +++ b/memory.c @@ -19,6 +19,7 @@ #include bitops.h #include kvm.h #include assert.h +#include hw/qdev.h #define WANT_EXEC_OBSOLETE #include exec-obsolete.h @@ -799,6 +800,7 @@ static bool memory_region_wrong_endianness(MemoryRegion *mr) #endif } +static MemoryRegionLifeOps nops; void memory_region_init(MemoryRegion *mr, const char *name, uint64_t size) @@ -809,6 +811,7 @@ void memory_region_init(MemoryRegion *mr, if (size == UINT64_MAX) { mr-size = int128_2_64(); } +mr-life_ops = nops; mr-addr = 0; mr-subpage = false; mr-enabled = true; @@ -931,6 +934,66 @@ static void memory_region_dispatch_write(MemoryRegion *mr, memory_region_write_accessor, mr); } +static void mr_object_get(MemoryRegion *mr) +{ +object_dynamic_cast_assert(OBJECT(mr-opaque), TYPE_DEVICE); +object_ref(OBJECT(mr-opaque)); +} + +static void mr_object_put(MemoryRegion *mr) +{ +object_unref(OBJECT(mr-opaque)); +} + +static MemoryRegionLifeOps obj_ops = { +.get = mr_object_get, +.put = mr_object_put, +}; + +static void mr_alias_get(MemoryRegion *mr) +{ +} + +static void mr_alias_put(MemoryRegion *mr) +{ +} + +static MemoryRegionLifeOps alias_ops = { +.get = mr_alias_get, +.put = mr_alias_put, +}; + +static void mr_nop_get(MemoryRegion *mr) +{ +} + +static void mr_nop_put(MemoryRegion *mr) +{ +} + +static MemoryRegionLifeOps nops = { +.get = mr_nop_get, +.put = mr_nop_put, +}; + +void memory_region_init_io_ext(MemoryRegion
[PATCH 06/15] memory: use refcnt to manage MemoryRegion
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Using refcnt for mr, so we can separate mr's life cycle management from refered object. When mr-ref 0-1, inc the refered object. When mr-ref 1-0, dec the refered object. The refered object can be DeviceStae, another mr, or other opaque. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c | 18 ++ memory.h |5 + 2 files changed, 23 insertions(+), 0 deletions(-) diff --git a/memory.c b/memory.c index 80c7529..5dc8b59 100644 --- a/memory.c +++ b/memory.c @@ -811,6 +811,7 @@ void memory_region_init(MemoryRegion *mr, if (size == UINT64_MAX) { mr-size = int128_2_64(); } +atomic_set(mr-ref, 0); mr-life_ops = nops; mr-addr = 0; mr-subpage = false; @@ -1090,6 +1091,23 @@ static const MemoryRegionOps reservation_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; +void memory_region_get(MemoryRegion *mr) +{ +if (atomic_add_and_return(1, mr-ref) == 1) { +mr-life_ops-get(mr); +} +} + +void memory_region_put(MemoryRegion *mr) +{ +assert(atomic_read(mr-ref) 0); + +if (atomic_dec_and_test(mr-ref)) { +/* to fix, using call_rcu( ,release) */ +mr-life_ops-put(mr); +} +} + void memory_region_init_reservation(MemoryRegion *mr, const char *name, uint64_t size) diff --git a/memory.h b/memory.h index 8fb543b..740f018 100644 --- a/memory.h +++ b/memory.h @@ -18,6 +18,7 @@ #include stdint.h #include stdbool.h +#include qemu/atomic.h #include qemu-common.h #include cpu-common.h #include targphys.h @@ -26,6 +27,7 @@ #include ioport.h #include int128.h #include qemu-thread.h +#include qemu/reclaimer.h typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; @@ -126,6 +128,7 @@ typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd; struct MemoryRegion { /* All fields are private - violators will be prosecuted */ const MemoryRegionOps *ops; +Atomic ref; MemoryRegionLifeOps *life_ops; void *opaque; MemoryRegion *parent; @@ -766,6 +769,8 @@ void memory_global_dirty_log_stop(void); void mtree_info(fprintf_function mon_printf, void *f); +void memory_region_get(MemoryRegion *mr); +void memory_region_put(MemoryRegion *mr); #endif #endif -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/15] atomic: introduce atomic operations
From: Liu Ping Fan pingf...@linux.vnet.ibm.com If out of global lock, we will be challenged by SMP in low level, so need atomic ops. This file is heavily copied from kernel. Currently, only x86 atomic ops included, and will be extended for other arch for future. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/atomic.h | 161 + 1 files changed, 161 insertions(+), 0 deletions(-) create mode 100644 include/qemu/atomic.h diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h new file mode 100644 index 000..8e1fc3e --- /dev/null +++ b/include/qemu/atomic.h @@ -0,0 +1,161 @@ +/* + * Simple interface for atomic operations. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef __QEMU_ATOMIC_H +#define __QEMU_ATOMIC_H 1 + +typedef struct Atomic { +int counter; +} Atomic; + + +#if defined(__i386__) || defined(__x86_64__) + +/** + * * atomic_read - read atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically reads the value of @v. + * */ +static inline int atomic_read(const Atomic *v) +{ +return (*(volatile int *)(v)-counter); +} + +/** + * * atomic_set - set atomic variable + * * @v: pointer of type Atomic + ** @i: required value + * * + * * Atomically sets the value of @v to @i. + * */ +static inline void atomic_set(Atomic *v, int i) +{ +v-counter = i; +} + +/** + * * atomic_add - add integer to atomic variable + * * @i: integer value to add + ** @v: pointer of type Atomic + * * + * * Atomically adds @i to @v. + * */ +static inline void atomic_add(int i, Atomic *v) +{ +asm volatile(lock; addl %1,%0 + : +m (v-counter) + : ir (i)); +} + +/** + * * atomic_sub - subtract integer from atomic variable + * * @i: integer value to subtract + ** @v: pointer of type Atomic + * * + * * Atomically subtracts @i from @v. + * */ +static inline void atomic_sub(int i, Atomic *v) +{ +asm volatile(lock; subl %1,%0 + : +m (v-counter) + : ir (i)); +} + +/** + * * atomic_sub_and_test - subtract value from variable and test result + * * @i: integer value to subtract + ** @v: pointer of type Atomic + * * + * * Atomically subtracts @i from @v and returns + * * true if the result is zero, or false for all + ** other cases. + * */ +static inline int atomic_sub_and_test(int i, Atomic *v) +{ +unsigned char c; + +asm volatile(lock; subl %2,%0; sete %1 + : +m (v-counter), =qm (c) + : ir (i) : memory); +return c; +} + +/** + * * atomic_inc - increment atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically increments @v by 1. + * */ +static inline void atomic_inc(Atomic *v) +{ +asm volatile(lock; incl %0 + : +m (v-counter)); +} + +/** + * * atomic_dec - decrement atomic variable + * * @v: pointer of type Atomic + ** + * * Atomically decrements @v by 1. + * */ +static inline void atomic_dec(Atomic *v) +{ +asm volatile(lock; decl %0 + : +m (v-counter)); +} + +/** + * * atomic_dec_and_test - decrement and test + * * @v: pointer of type Atomic + ** + * * Atomically decrements @v by 1 and + * * returns true if the result is 0, or false for all other + * * cases. + **/ +static inline int atomic_dec_and_test(Atomic *v) +{ +unsigned char c; + +asm volatile(lock; decl %0; sete %1 + : +m (v-counter), =qm (c) + : : memory); +return c != 0; +} + +/** + * * atomic_inc_and_test - increment and test + * * @v: pointer of type Atomic + ** + * * Atomically increments @v by 1 + * * and returns true if the result is zero, or false for all + * * other cases. + **/ +static inline int atomic_inc_and_test(Atomic *v) +{ +unsigned char c; + +asm volatile(lock; incl %0; sete %1 + : +m (v-counter), =qm (c) + : : memory); +return c != 0; +} + +static inline int atomic_add_and_return(int i, Atomic *v) +{ +int ret = i; + +asm volatile (lock; xaddl %0, %1 +: +r (ret), +m (v-counter) +: : memory, cc); + +return ret + i; +} +#endif + +#endif -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/15] memory: inc/dec mr's ref when adding/removing from mem view
From: Liu Ping Fan pingf...@linux.vnet.ibm.com memory_region_{add,del}_subregion will inc/dec mr's refcnt. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/memory.c b/memory.c index 5dc8b59..2eaa2fc 100644 --- a/memory.c +++ b/memory.c @@ -1356,7 +1356,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, assert(!subregion-parent); subregion-parent = mr; subregion-addr = offset; - +memory_region_get(subregion); qemu_mutex_lock(mem_map_lock); QTAILQ_FOREACH(other, mr-subregions, subregions_link) { if (subregion-may_overlap || other-may_overlap) { @@ -1420,6 +1420,8 @@ void memory_region_del_subregion(MemoryRegion *mr, qemu_mutex_lock(mem_map_lock); QTAILQ_REMOVE(mr-subregions, subregion, subregions_link); memory_region_update_topology(mr); +/* mr may be still in use by reader of radix, must delay to release */ +memory_region_put(subregion); qemu_mutex_unlock(mem_map_lock); } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/15] memory: introduce PhysMap to present snapshot of toploygy
From: Liu Ping Fan pingf...@linux.vnet.ibm.com PhysMap contain the flatview and radix-tree view, they are snapshot of system topology and should be consistent. With PhysMap, we can swap the pointer when updating and achieve the atomic. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c |8 memory.c | 33 - memory.h | 62 -- 3 files changed, 60 insertions(+), 43 deletions(-) diff --git a/exec.c b/exec.c index 0e29ef9..01b91b0 100644 --- a/exec.c +++ b/exec.c @@ -156,8 +156,6 @@ typedef struct PageDesc { #endif /* Size of the L2 (and L3, etc) page tables. */ -#define L2_BITS 10 -#define L2_SIZE (1 L2_BITS) #define P_L2_LEVELS \ (((TARGET_PHYS_ADDR_SPACE_BITS - TARGET_PAGE_BITS - 1) / L2_BITS) + 1) @@ -185,7 +183,6 @@ uintptr_t qemu_host_page_mask; static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -typedef struct PhysPageEntry PhysPageEntry; static MemoryRegionSection *phys_sections; static unsigned phys_sections_nb, phys_sections_nb_alloc; @@ -194,11 +191,6 @@ static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; -struct PhysPageEntry { -uint16_t is_leaf : 1; - /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ -uint16_t ptr : 15; -}; /* Simple allocator for PhysPageEntry nodes */ static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; diff --git a/memory.c b/memory.c index 2eaa2fc..c7f2cfd 100644 --- a/memory.c +++ b/memory.c @@ -31,17 +31,6 @@ static bool global_dirty_log = false; static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners = QTAILQ_HEAD_INITIALIZER(memory_listeners); -typedef struct AddrRange AddrRange; - -/* - * Note using signed integers limits us to physical addresses at most - * 63 bits wide. They are needed for negative offsetting in aliases - * (large MemoryRegion::alias_offset). - */ -struct AddrRange { -Int128 start; -Int128 size; -}; static AddrRange addrrange_make(Int128 start, Int128 size) { @@ -197,28 +186,6 @@ static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd a, !memory_region_ioeventfd_before(b, a); } -typedef struct FlatRange FlatRange; -typedef struct FlatView FlatView; - -/* Range of memory in the global map. Addresses are absolute. */ -struct FlatRange { -MemoryRegion *mr; -target_phys_addr_t offset_in_region; -AddrRange addr; -uint8_t dirty_log_mask; -bool readable; -bool readonly; -}; - -/* Flattened global view of current active memory hierarchy. Kept in sorted - * order. - */ -struct FlatView { -FlatRange *ranges; -unsigned nr; -unsigned nr_allocated; -}; - typedef struct AddressSpace AddressSpace; typedef struct AddressSpaceOps AddressSpaceOps; diff --git a/memory.h b/memory.h index 740f018..357edd8 100644 --- a/memory.h +++ b/memory.h @@ -29,12 +29,72 @@ #include qemu-thread.h #include qemu/reclaimer.h +typedef struct AddrRange AddrRange; +typedef struct FlatRange FlatRange; +typedef struct FlatView FlatView; +typedef struct PhysPageEntry PhysPageEntry; +typedef struct PhysMap PhysMap; +typedef struct MemoryRegionSection MemoryRegionSection; typedef struct MemoryRegionOps MemoryRegionOps; typedef struct MemoryRegionLifeOps MemoryRegionLifeOps; typedef struct MemoryRegion MemoryRegion; typedef struct MemoryRegionPortio MemoryRegionPortio; typedef struct MemoryRegionMmio MemoryRegionMmio; +/* + * Note using signed integers limits us to physical addresses at most + * 63 bits wide. They are needed for negative offsetting in aliases + * (large MemoryRegion::alias_offset). + */ +struct AddrRange { +Int128 start; +Int128 size; +}; + +/* Range of memory in the global map. Addresses are absolute. */ +struct FlatRange { +MemoryRegion *mr; +target_phys_addr_t offset_in_region; +AddrRange addr; +uint8_t dirty_log_mask; +bool readable; +bool readonly; +}; + +/* Flattened global view of current active memory hierarchy. Kept in sorted + * order. + */ +struct FlatView { +FlatRange *ranges; +unsigned nr; +unsigned nr_allocated; +}; + +struct PhysPageEntry { +uint16_t is_leaf:1; + /* index into phys_sections (is_leaf) or phys_map_nodes (!is_leaf) */ +uint16_t ptr:15; +}; + +#define L2_BITS 10 +#define L2_SIZE (1 L2_BITS) +/* This is a multi-level map on the physical address space. + The bottom level has pointers to MemoryRegionSections. */ +struct PhysMap { +Atomic ref; +PhysPageEntry root; +PhysPageEntry (*phys_map_nodes)[L2_SIZE]; +unsigned phys_map_nodes_nb; +unsigned phys_map_nodes_nb_alloc; + +MemoryRegionSection *phys_sections; +unsigned phys_sections_nb; +unsigned phys_sections_nb_alloc; + +/* FlatView */ +FlatView views[2]; +}; + /* Must match *_DIRTY_FLAGS in cpu-all.h. To be replaced with dynamic
[PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Flatview and radix view are all under the protection of pointer. And this make sure the change of them seem to be atomic! The mr accessed by radix-tree leaf or flatview will be reclaimed after the prev PhysMap not in use any longer Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 303 +++--- hw/vhost.c |2 +- hw/xen_pt.c |2 +- kvm-all.c |2 +- memory.c| 92 ++- memory.h|9 ++- vl.c|1 + xen-all.c |2 +- 8 files changed, 286 insertions(+), 127 deletions(-) diff --git a/exec.c b/exec.c index 01b91b0..97addb9 100644 --- a/exec.c +++ b/exec.c @@ -24,6 +24,7 @@ #include sys/mman.h #endif +#include qemu/atomic.h #include qemu-common.h #include cpu.h #include tcg.h @@ -35,6 +36,8 @@ #include qemu-timer.h #include memory.h #include exec-memory.h +#include qemu-thread.h +#include qemu/reclaimer.h #if defined(CONFIG_USER_ONLY) #include qemu.h #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE]; #if !defined(CONFIG_USER_ONLY) -static MemoryRegionSection *phys_sections; -static unsigned phys_sections_nb, phys_sections_nb_alloc; static uint16_t phys_section_unassigned; static uint16_t phys_section_notdirty; static uint16_t phys_section_rom; static uint16_t phys_section_watch; - -/* Simple allocator for PhysPageEntry nodes */ -static PhysPageEntry (*phys_map_nodes)[L2_SIZE]; -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc; - #define PHYS_MAP_NODE_NIL (((uint16_t)~0) 1) -/* This is a multi-level map on the physical address space. - The bottom level has pointers to MemoryRegionSections. */ -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 }; - +static QemuMutex cur_map_lock; +static PhysMap *cur_map; QemuMutex mem_map_lock; +static PhysMap *next_map; static void io_mem_init(void); static void memory_map_init(void); @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index) #if !defined(CONFIG_USER_ONLY) -static void phys_map_node_reserve(unsigned nodes) +static void phys_map_node_reserve(PhysMap *map, unsigned nodes) { -if (phys_map_nodes_nb + nodes phys_map_nodes_nb_alloc) { +if (map-phys_map_nodes_nb + nodes map-phys_map_nodes_nb_alloc) { typedef PhysPageEntry Node[L2_SIZE]; -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16); -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc, - phys_map_nodes_nb + nodes); -phys_map_nodes = g_renew(Node, phys_map_nodes, - phys_map_nodes_nb_alloc); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc * 2, +16); +map-phys_map_nodes_nb_alloc = MAX(map-phys_map_nodes_nb_alloc, + map-phys_map_nodes_nb + nodes); +map-phys_map_nodes = g_renew(Node, map-phys_map_nodes, + map-phys_map_nodes_nb_alloc); } } -static uint16_t phys_map_node_alloc(void) +static uint16_t phys_map_node_alloc(PhysMap *map) { unsigned i; uint16_t ret; -ret = phys_map_nodes_nb++; +ret = map-phys_map_nodes_nb++; assert(ret != PHYS_MAP_NODE_NIL); -assert(ret != phys_map_nodes_nb_alloc); +assert(ret != map-phys_map_nodes_nb_alloc); for (i = 0; i L2_SIZE; ++i) { -phys_map_nodes[ret][i].is_leaf = 0; -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; +map-phys_map_nodes[ret][i].is_leaf = 0; +map-phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL; } return ret; } -static void phys_map_nodes_reset(void) -{ -phys_map_nodes_nb = 0; -} - - -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, -target_phys_addr_t *nb, uint16_t leaf, +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp, +target_phys_addr_t *index, +target_phys_addr_t *nb, +uint16_t leaf, int level) { PhysPageEntry *p; @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, target_phys_addr_t step = (target_phys_addr_t)1 (level * L2_BITS); if (!lp-is_leaf lp-ptr == PHYS_MAP_NODE_NIL) { -lp-ptr = phys_map_node_alloc(); -p = phys_map_nodes[lp-ptr]; +lp-ptr = phys_map_node_alloc(map); +p = map-phys_map_nodes[lp-ptr]; if (level == 0) { for (i = 0; i L2_SIZE; i++) { p[i].is_leaf = 1; @@ -434,7 +426,7 @@ static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t *index, } } } else
[PATCH 10/15] memory: change tcg related code to using PhysMap
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Change tcg code to use PhysMap. This is separated from the prev patch for review purpose. Should be merged into prev one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- exec.c | 27 +-- 1 files changed, 21 insertions(+), 6 deletions(-) diff --git a/exec.c b/exec.c index 97addb9..8d0dea5 100644 --- a/exec.c +++ b/exec.c @@ -1923,6 +1923,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, { target_phys_addr_t iotlb; CPUWatchpoint *wp; +PhysMap *map = cur_map_get(); if (memory_region_is_ram(section-mr)) { /* Normal RAM. */ @@ -1940,7 +1941,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, and avoid full address decoding in every device. We can't use the high bits of pd for this because IO_MEM_ROMD uses these as a ram address. */ -iotlb = section - phys_sections; +iotlb = section - map-phys_sections; iotlb += memory_region_section_addr(section, paddr); } @@ -1956,6 +1957,7 @@ target_phys_addr_t memory_region_section_get_iotlb(CPUArchState *env, } } } +physmap_put(map); return iotlb; } @@ -3185,7 +3187,12 @@ static uint16_t dummy_section(PhysMap *map, MemoryRegion *mr) MemoryRegion *iotlb_to_region(target_phys_addr_t index) { -return phys_sections[index ~TARGET_PAGE_MASK].mr; +MemoryRegion *ret; +PhysMap *map = cur_map_get(); + +ret = map-phys_sections[index ~TARGET_PAGE_MASK].mr; +physmap_put(map); +return ret; } static void io_mem_init(void) @@ -3946,13 +3953,14 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val) { uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } io_mem_write(section-mr, addr, val, 4); } else { @@ -3972,19 +3980,21 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val) } } } +physmap_put(map); } void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val) { uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #ifdef TARGET_WORDS_BIGENDIAN io_mem_write(section-mr, addr, val 32, 4); @@ -3999,6 +4009,7 @@ void stq_phys_notdirty(target_phys_addr_t addr, uint64_t val) + memory_region_section_addr(section, addr)); stq_p(ptr, val); } +physmap_put(map); } /* warning: addr must be aligned */ @@ -4008,12 +4019,13 @@ static inline void stl_phys_internal(target_phys_addr_t addr, uint32_t val, uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #if defined(TARGET_WORDS_BIGENDIAN) if (endian == DEVICE_LITTLE_ENDIAN) { @@ -4050,6 +4062,7 @@ static inline void stl_phys_internal(target_phys_addr_t addr, uint32_t val, (0xff ~CODE_DIRTY_FLAG)); } } +physmap_put(map); } void stl_phys(target_phys_addr_t addr, uint32_t val) @@ -4081,12 +4094,13 @@ static inline void stw_phys_internal(target_phys_addr_t addr, uint32_t val, uint8_t *ptr; MemoryRegionSection *section; +PhysMap *map = cur_map_get(); section = phys_page_find(addr TARGET_PAGE_BITS); if (!memory_region_is_ram(section-mr) || section-readonly) { addr = memory_region_section_addr(section, addr); if (memory_region_is_ram(section-mr)) { -section = phys_sections[phys_section_rom]; +section = map-phys_sections[phys_section_rom]; } #if defined(TARGET_WORDS_BIGENDIAN) if (endian == DEVICE_LITTLE_ENDIAN) { @@ -4123,6 +4137,7 @@ static inline void stw_phys_internal(target_phys_addr_t addr, uint32_t val, (0xff ~CODE_DIRTY_FLAG)); } } +physmap_put(map); } void
[PATCH 11/15] lock: introduce global lock for device tree
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- cpus.c | 12 main-loop.h |3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..a734b36 100644 --- a/cpus.c +++ b/cpus.c @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void) } #endif /* _WIN32 */ +QemuMutex qemu_device_tree_mutex; QemuMutex qemu_global_mutex; static QemuCond qemu_io_proceeded_cond; static bool iothread_requesting_mutex; @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void) qemu_cond_init(qemu_work_cond); qemu_cond_init(qemu_io_proceeded_cond); qemu_mutex_init(qemu_global_mutex); +qemu_mutex_init(qemu_device_tree_mutex); qemu_thread_get_self(io_thread); } @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void) qemu_mutex_unlock(qemu_global_mutex); } +void qemu_lock_devtree(void) +{ +qemu_mutex_lock(qemu_device_tree_mutex); +} + +void qemu_unlock_devtree(void) +{ +qemu_mutex_unlock(qemu_device_tree_mutex); +} + static int all_vcpus_paused(void) { CPUArchState *penv = first_cpu; diff --git a/main-loop.h b/main-loop.h index dce1cd9..17e959a 100644 --- a/main-loop.h +++ b/main-loop.h @@ -353,6 +353,9 @@ void qemu_mutex_lock_iothread(void); */ void qemu_mutex_unlock_iothread(void); +void qemu_lock_devtree(void); +void qemu_unlock_devtree(void); + /* internal interfaces */ void qemu_fd_register(int fd); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/15] qdev: using devtree lock to protect device's accessing
From: Liu Ping Fan pingf...@linux.vnet.ibm.com lock: qemu_device_tree_mutex competitors: --device_del(destruction of device will be postphoned until unplug ack from guest), --pci hot-unplug --iteration (qdev_reset_all) --device_add Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/pci-hotplug.c |4 hw/qdev-monitor.c | 17 - hw/qdev.c |2 ++ 3 files changed, 22 insertions(+), 1 deletions(-) diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c index e7fb780..33a9dfe 100644 --- a/hw/pci-hotplug.c +++ b/hw/pci-hotplug.c @@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const char *pci_addr) return -1; } +qemu_lock_devtree(); d = pci_find_device(pci_find_root_bus(dom), bus, PCI_DEVFN(slot, 0)); if (!d) { monitor_printf(mon, slot %d empty\n, slot); +qemu_unlock_devtree(); return -1; } @@ -275,9 +277,11 @@ static int pci_device_hot_remove(Monitor *mon, const char *pci_addr) if (error_is_set(local_err)) { monitor_printf(mon, %s\n, error_get_pretty(local_err)); error_free(local_err); +qemu_unlock_devtree(); return -1; } +qemu_unlock_devtree(); return 0; } diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c index 7915b45..2d47fe0 100644 --- a/hw/qdev-monitor.c +++ b/hw/qdev-monitor.c @@ -429,14 +429,18 @@ DeviceState *qdev_device_add(QemuOpts *opts) /* find bus */ path = qemu_opt_get(opts, bus); + +qemu_lock_devtree(); if (path != NULL) { bus = qbus_find(path); if (!bus) { +qemu_unlock_devtree(); return NULL; } if (strcmp(object_get_typename(OBJECT(bus)), k-bus_type) != 0) { qerror_report(QERR_BAD_BUS_FOR_DEVICE, driver, object_get_typename(OBJECT(bus))); +qemu_unlock_devtree(); return NULL; } } else { @@ -444,11 +448,13 @@ DeviceState *qdev_device_add(QemuOpts *opts) if (!bus) { qerror_report(QERR_NO_BUS_FOR_DEVICE, driver, k-bus_type); +qemu_unlock_devtree(); return NULL; } } if (qdev_hotplug !bus-allow_hotplug) { qerror_report(QERR_BUS_NO_HOTPLUG, bus-name); +qemu_unlock_devtree(); return NULL; } @@ -466,6 +472,7 @@ DeviceState *qdev_device_add(QemuOpts *opts) } if (qemu_opt_foreach(opts, set_property, qdev, 1) != 0) { qdev_free(qdev); +qemu_unlock_devtree(); return NULL; } if (qdev-id) { @@ -478,6 +485,8 @@ DeviceState *qdev_device_add(QemuOpts *opts) OBJECT(qdev), NULL); g_free(name); } +qemu_unlock_devtree(); + if (qdev_init(qdev) 0) { qerror_report(QERR_DEVICE_INIT_FAILED, driver); return NULL; @@ -600,13 +609,19 @@ void qmp_device_del(const char *id, Error **errp) { DeviceState *dev; +/* protect against unplug ack from guest, where we really remove device + * from system + */ +qemu_lock_devtree(); dev = qdev_find_recursive(sysbus_get_default(), id); if (NULL == dev) { error_set(errp, QERR_DEVICE_NOT_FOUND, id); +qemu_unlock_devtree(); return; } - +/* Just remove from system, and drop refcnt there*/ qdev_unplug(dev, errp); +qemu_unlock_devtree(); } void qdev_machine_init(void) diff --git a/hw/qdev.c b/hw/qdev.c index af54467..17525fe 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -230,7 +230,9 @@ static int qbus_reset_one(BusState *bus, void *opaque) void qdev_reset_all(DeviceState *dev) { +qemu_lock_devtree(); qdev_walk_children(dev, qdev_reset_one, qbus_reset_one, NULL); +qemu_unlock_devtree(); } void qbus_reset_all_fn(void *opaque) -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views
From: Liu Ping Fan pingf...@linux.vnet.ibm.com When guest confirm the removal of device, we should --unmap from MemoryRegion view --isolated from device tree view Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/acpi_piix4.c |4 ++-- hw/pci.c| 13 - hw/pci.h|2 ++ hw/qdev.c | 28 hw/qdev.h |3 ++- 5 files changed, 46 insertions(+), 4 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 0aace60..c209ff7 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -305,8 +305,8 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots) if (pc-no_hotplug) { slot_free = false; } else { -object_unparent(OBJECT(dev)); -qdev_free(qdev); +/* refcnt will be decreased */ +qdev_unplug_complete(qdev, NULL); } } } diff --git a/hw/pci.c b/hw/pci.c index 99a4304..2095abf 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -856,12 +856,22 @@ static int pci_unregister_device(DeviceState *dev) if (ret) return ret; -pci_unregister_io_regions(pci_dev); pci_del_option_rom(pci_dev); do_pci_unregister_device(pci_dev); return 0; } +static void pci_unmap_device(DeviceState *dev) +{ +PCIDevice *pci_dev = PCI_DEVICE(dev); +PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev); + +pci_unregister_io_regions(pci_dev); +if (pc-unmap) { +pc-unmap(pci_dev); +} +} + void pci_register_bar(PCIDevice *pci_dev, int region_num, uint8_t type, MemoryRegion *memory) { @@ -2022,6 +2032,7 @@ static void pci_device_class_init(ObjectClass *klass, void *data) DeviceClass *k = DEVICE_CLASS(klass); k-init = pci_qdev_init; k-unplug = pci_unplug_device; +k-unmap = pci_unmap_device; k-exit = pci_unregister_device; k-bus_type = TYPE_PCI_BUS; k-props = pci_props; diff --git a/hw/pci.h b/hw/pci.h index 79d38fd..1c5b909 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -145,6 +145,8 @@ typedef struct PCIDeviceClass { DeviceClass parent_class; int (*init)(PCIDevice *dev); +void (*unmap)(PCIDevice *dev); + PCIUnregisterFunc *exit; PCIConfigReadFunc *config_read; PCIConfigWriteFunc *config_write; diff --git a/hw/qdev.c b/hw/qdev.c index 17525fe..530eabe 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -104,6 +104,14 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus) bus_add_child(bus, dev); } +static void qdev_unset_parent(DeviceState *dev) +{ +BusState *b = dev-parent_bus; + +object_unparent(OBJECT(dev)); +bus_remove_child(b, dev); +} + /* Create a new device. This only initializes the device state structure and allows properties to be set. qdev_init should be called to initialize the actual device emulation. */ @@ -194,6 +202,26 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, dev-alias_required_for_version = required_for_version; } +static int qdev_unmap(DeviceState *dev) +{ +DeviceClass *dc = DEVICE_GET_CLASS(dev); +if (dc-unmap) { +dc-unmap(dev); +} +return 0; +} + +void qdev_unplug_complete(DeviceState *dev, Error **errp) +{ +/* isolate from mem view */ +qdev_unmap(dev); +qemu_lock_devtree(); +/* isolate from device tree */ +qdev_unset_parent(dev); +qemu_unlock_devtree(); +object_unref(OBJECT(dev)); +} + void qdev_unplug(DeviceState *dev, Error **errp) { DeviceClass *dc = DEVICE_GET_CLASS(dev); diff --git a/hw/qdev.h b/hw/qdev.h index f4683dc..705635a 100644 --- a/hw/qdev.h +++ b/hw/qdev.h @@ -47,7 +47,7 @@ typedef struct DeviceClass { /* callbacks */ void (*reset)(DeviceState *dev); - +void (*unmap)(DeviceState *dev); /* device state */ const VMStateDescription *vmsd; @@ -162,6 +162,7 @@ void qdev_init_nofail(DeviceState *dev); void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id, int required_for_version); void qdev_unplug(DeviceState *dev, Error **errp); +void qdev_unplug_complete(DeviceState *dev, Error **errp); void qdev_free(DeviceState *dev); int qdev_simple_unplug_cb(DeviceState *dev); void qdev_machine_creation_done(void); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/15] qom: object_unref call reclaimer
From: Liu Ping Fan pingf...@linux.vnet.ibm.com iohandler/bh/timer may use DeviceState when its refcnt=0, postpone the reclaimer till they have done with it. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- qom/object.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/qom/object.c b/qom/object.c index 822bdb7..1452b1b 100644 --- a/qom/object.c +++ b/qom/object.c @@ -23,6 +23,8 @@ #include qbool.h #include qint.h #include qstring.h +#include hw/qdev.h +#include qemu/reclaimer.h #define MAX_INTERFACES 32 @@ -646,7 +648,12 @@ void object_unref(Object *obj) { g_assert(atomic_read(obj-ref) 0); if (atomic_dec_and_test(obj-ref)) { -object_finalize(obj); +/* fixme, maybe introduce obj-finalze to make this more elegant */ +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) { +qemu_reclaimer_enqueue(obj, object_finalize); +} else { +object_finalize(obj); +} } } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/15] e1000: using new interface--unmap to unplug
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/e1000.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/hw/e1000.c b/hw/e1000.c index 4573f13..fa71455 100644 --- a/hw/e1000.c +++ b/hw/e1000.c @@ -1192,6 +1192,13 @@ e1000_cleanup(VLANClientState *nc) s-nic = NULL; } +static void +pci_e1000_unmap(PCIDevice *p) +{ +/* DO NOT FREE anything!until refcnt=0 */ +/* isolate from memory view */ +} + static int pci_e1000_uninit(PCIDevice *dev) { @@ -1275,6 +1282,7 @@ static void e1000_class_init(ObjectClass *klass, void *data) PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); k-init = pci_e1000_init; +k-unmap = pci_e1000_unmap; k-exit = pci_e1000_uninit; k-romfile = pxe-e1000.rom; k-vendor_id = PCI_VENDOR_ID_INTEL; -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it
On Wed, Jul 25, 2012 at 5:08 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 25/07/2012 05:31, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com rwlock: qemu_device_tree_mutex rd side: --device_del(destruction of device will be postphoned until unplug ack from guest), --pci hot-unplug --iteration (qdev_reset_all) wr side: --device_add Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- hw/pci-hotplug.c |4 hw/qdev-monitor.c | 17 - hw/qdev.c |2 ++ 3 files changed, 22 insertions(+), 1 deletions(-) diff --git a/hw/pci-hotplug.c b/hw/pci-hotplug.c index e7fb780..b3b88c1 100644 --- a/hw/pci-hotplug.c +++ b/hw/pci-hotplug.c @@ -265,9 +265,11 @@ static int pci_device_hot_remove(Monitor *mon, const char *pci_addr) return -1; } +qemu_rwlock_rdlock_devtree(); This is not defined anywhere, is a piece missing in the patch? Oh, yes, I miss the patch. In that patch, these rwlock are just place holder. I see there is already try to implement rwlock for qemu. http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg00192.html and is it the time for introduce rwlock for qemu? Thanks, pingfan Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching
On Wed, Jul 25, 2012 at 5:18 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 25/07/2012 10:12, liu ping fan ha scritto: +qemu_rwlock_rdlock_devtree(); section = phys_page_find(page TARGET_PAGE_BITS); +if (!(memory_region_is_ram(section-mr) || +memory_region_is_romd(section-mr)) !is_write) { +bk = get_backend(section-mr, addr); +object_ref(bk); +} else if (!memory_region_is_ram(section-mr) is_write) { +bk = get_backend(section-mr, addr); +object_ref(bk); +} +qemu_rwlock_unlock_devtree(); if (is_write) { if (!memory_region_is_ram(section-mr)) { @@ -3426,6 +3462,7 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_mem_write(section-mr, addr1, val, 1); l = 1; } +object_unref(bk); Currently object_ref()/object_unref() are not atomic. Will you send We obey the rule: rdlock-search-ref_get, wrlock-remove -ref_put So can it causes problem if object_ref()/object_unref() are not atomic? Yes, two CPUs can perform object_ref at the same time. You can find a header file for atomic operations here: https://github.com/bonzini/qemu/commit/atomics.patch Got it, thanks Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching
On Wed, Jul 25, 2012 at 8:27 PM, Avi Kivity a...@redhat.com wrote: On 07/25/2012 01:58 PM, Avi Kivity wrote: while (len 0) { page = addr TARGET_PAGE_MASK; l = (page + TARGET_PAGE_SIZE) - addr; if (l len) l = len; + +qemu_rwlock_rdlock_devtree(); section = phys_page_find(page TARGET_PAGE_BITS); Does the devtree lock also protect the data structures accessed by phys_page_find()? Seems wrong. The right way is to object_ref() in core_region_add() and object_unref() in core_region_del(). We're guaranteed that mr-object is alive during _add(), and DeviceClass::unmap() ensures that the extra ref doesn't block destruction. OK, I see. I will try in this way. But when memory_region_destroy()-..-core_region_del(), should we reset the lp.ptr to phys_section_unassigned , otherwise, if using removed target_phys_addr_t, we will still get the pointer to invalid MemoryRegion? Thanx, pingfan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it
On Thu, Jul 26, 2012 at 9:00 PM, Avi Kivity a...@redhat.com wrote: On 07/26/2012 03:56 PM, liu ping fan wrote: On Wed, Jul 25, 2012 at 5:08 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 25/07/2012 05:31, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com rwlock: qemu_device_tree_mutex rd side: --device_del(destruction of device will be postphoned until unplug ack from guest), --pci hot-unplug --iteration (qdev_reset_all) wr side: --device_add This is not defined anywhere, is a piece missing in the patch? Oh, yes, I miss the patch. In that patch, these rwlock are just place holder. I see there is already try to implement rwlock for qemu. http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg00192.html and is it the time for introduce rwlock for qemu? From the description above, I don't see why it can't be a mutex. Searching in the device tree (or MemoryRegion view) can be often in parallel, especially in mmio-dispatch code path Thanx, pingfan -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] qom: adopt rwlock to protect accessing dev from removing it
On Thu, Jul 26, 2012 at 9:15 PM, Avi Kivity a...@redhat.com wrote: On 07/26/2012 04:14 PM, liu ping fan wrote: From the description above, I don't see why it can't be a mutex. Searching in the device tree (or MemoryRegion view) can be often in parallel, especially in mmio-dispatch code path In mmio dispatch we have a pointer to the object, we don't need to search anything. Is device tree search a hot path? I think, we need lock to protect searching --phys_page_find() from deleter--DeviceClass:unmap, so rwlock? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] exec.c: use refcnt to protect device during dispatching
On Wed, Jul 25, 2012 at 3:43 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Jul 25, 2012 at 4:31 AM, Liu Ping Fan qemul...@gmail.com wrote: @@ -3396,13 +3420,25 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, uint32_t val; target_phys_addr_t page; MemoryRegionSection *section; +Object *bk; while (len 0) { page = addr TARGET_PAGE_MASK; l = (page + TARGET_PAGE_SIZE) - addr; if (l len) l = len; + +qemu_rwlock_rdlock_devtree(); section = phys_page_find(page TARGET_PAGE_BITS); +if (!(memory_region_is_ram(section-mr) || +memory_region_is_romd(section-mr)) !is_write) { +bk = get_backend(section-mr, addr); +object_ref(bk); +} else if (!memory_region_is_ram(section-mr) is_write) { +bk = get_backend(section-mr, addr); +object_ref(bk); +} +qemu_rwlock_unlock_devtree(); if (is_write) { if (!memory_region_is_ram(section-mr)) { @@ -3426,6 +3462,7 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_mem_write(section-mr, addr1, val, 1); l = 1; } +object_unref(bk); Currently object_ref()/object_unref() are not atomic. Will you send We obey the rule: rdlock-search-ref_get, wrlock-remove -ref_put So can it causes problem if object_ref()/object_unref() are not atomic? Thanx, pingfan another patch to perform atomic increment/decrement or how will per-device synchronization work? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop
On Wed, Jul 25, 2012 at 3:37 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 25/07/2012 09:03, Stefan Hajnoczi ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com iohandler/bh/timer may use DeviceState when its refcnt=0, It's not clear how to me. The only reference to devices from an iohandler/bh/timer can be in the opaque. Now, if you have a iohandler/bh/timer whose opaque is a DeviceState, you should bump the refcount before setting it up, and unref after tearing it down. Yes, I admit refcnt is a good solution, but I think it means that we will fix it with each device's bh. And this way seems lazy. Thanx pingfan See for example how hw/scsi-disk.c bumps the refcount of a request before starting asynchronous I/O and calls unref at the end of the asynchronous I/O callback. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] qom: delay DeviceState's reclaim to main-loop
On Wed, Jul 25, 2012 at 3:03 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Jul 25, 2012 at 4:31 AM, Liu Ping Fan qemul...@gmail.com wrote: From: Liu Ping Fan pingf...@linux.vnet.ibm.com iohandler/bh/timer may use DeviceState when its refcnt=0, postpone the reclaimer till they have done with it. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/qemu/object.h |2 +- main-loop.c |4 main-loop.h |2 ++ qemu-tool.c |4 qom/Makefile.objs |2 +- qom/object.c |7 ++- qom/reclaimer.c | 41 + 7 files changed, 59 insertions(+), 3 deletions(-) create mode 100644 qom/reclaimer.c diff --git a/include/qemu/object.h b/include/qemu/object.h index 8b17776..b233ee4 100644 --- a/include/qemu/object.h +++ b/include/qemu/object.h @@ -958,5 +958,5 @@ int object_child_foreach(Object *obj, int (*fn)(Object *child, void *opaque), */ Object *container_get(Object *root, const char *path); - +void qemu_reclaimer_enqueue(Object *obj); #endif diff --git a/main-loop.c b/main-loop.c index eb3b6e6..f9cecc5 100644 --- a/main-loop.c +++ b/main-loop.c @@ -505,5 +505,9 @@ int main_loop_wait(int nonblocking) them. */ qemu_bh_poll(); +/* ref to device from iohandler/bh/timer do not obey the rules, so delay + * reclaiming until now. + */ +qemu_device_reclaimer(); return ret; } diff --git a/main-loop.h b/main-loop.h index cedddf5..1a59a6d 100644 --- a/main-loop.h +++ b/main-loop.h @@ -367,4 +367,6 @@ void qemu_bh_schedule_idle(QEMUBH *bh); int qemu_bh_poll(void); void qemu_bh_update_timeout(uint32_t *timeout); +void qemu_device_reclaimer(void); + #endif diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..34d959b 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -75,6 +75,10 @@ void qemu_mutex_unlock_iothread(void) { } +void qemu_device_reclaimer(void) +{ +} + int use_icount; void qemu_clock_warp(QEMUClock *clock) diff --git a/qom/Makefile.objs b/qom/Makefile.objs index 5ef060a..a579261 100644 --- a/qom/Makefile.objs +++ b/qom/Makefile.objs @@ -1,4 +1,4 @@ -qom-obj-y = object.o container.o qom-qobject.o +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o qom-obj-twice-y = cpu.o common-obj-y = $(qom-obj-twice-y) user-obj-y = $(qom-obj-twice-y) diff --git a/qom/object.c b/qom/object.c index 00bb3b0..227d966 100644 --- a/qom/object.c +++ b/qom/object.c @@ -649,7 +649,12 @@ void object_unref(Object *obj) /* parent always holds a reference to its children */ if (obj-ref == 0) { -object_finalize(obj); +/* fixme, maybe introduce obj-finalze to make this more elegant */ +if (object_dynamic_cast(obj, TYPE_DEVICE) != NULL) { hw/qdev.h:#define TYPE_DEVICE device This should be object_dynamic_cast(obj, TYPE_DEVICE). Yes, thanks. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html