Re: [PATCH] irq_remapping: move structs outside #ifdef
On Fri, 18 Sep 2015, Paolo Bonzini wrote: > This is friendlier to clients of the code, who are going to prepare > vcpu_data structs unconditionally, even if CONFIG_IRQ_REMAP is not > defined. > > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> Reviewed-by: Thomas Gleixner <t...@linutronix.de> > --- > Please ack, I'd like to include it in the 4.4 pull request. > > arch/x86/include/asm/irq_remapping.h | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/irq_remapping.h > b/arch/x86/include/asm/irq_remapping.h > index 046c7fb1ca43..a210eba2727c 100644 > --- a/arch/x86/include/asm/irq_remapping.h > +++ b/arch/x86/include/asm/irq_remapping.h > @@ -33,6 +33,11 @@ enum irq_remap_cap { > IRQ_POSTING_CAP = 0, > }; > > +struct vcpu_data { > + u64 pi_desc_addr; /* Physical address of PI Descriptor */ > + u32 vector; /* Guest vector of the interrupt */ > +}; > + > #ifdef CONFIG_IRQ_REMAP > > extern bool irq_remapping_cap(enum irq_remap_cap cap); > @@ -58,11 +63,6 @@ static inline struct irq_domain > *arch_get_ir_parent_domain(void) > return x86_vector_domain; > } > > -struct vcpu_data { > - u64 pi_desc_addr; /* Physical address of PI Descriptor */ > - u32 vector; /* Guest vector of the interrupt */ > -}; > - > #else /* CONFIG_IRQ_REMAP */ > > static inline bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; } > -- > 1.8.3.1 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/4] irqchip: GIC: Convert to EOImode == 1
On Tue, 25 Aug 2015, Marc Zyngier wrote: +static struct static_key supports_deactivate = STATIC_KEY_INIT_TRUE; + #ifndef MAX_GIC_NR #define MAX_GIC_NR 1 #endif @@ -137,6 +140,14 @@ static inline unsigned int gic_irq(struct irq_data *d) return d-hwirq; } +static inline bool primary_gic_irq(struct irq_data *d) +{ + if (MAX_GIC_NR 1) + return irq_data_get_irq_chip_data(d) == gic_data[0]; + + return true; +} + /* * Routines to acknowledge, disable and enable interrupts */ @@ -164,7 +175,14 @@ static void gic_unmask_irq(struct irq_data *d) static void gic_eoi_irq(struct irq_data *d) { - writel_relaxed(gic_irq(d), gic_cpu_base(d) + GIC_CPU_EOI); + u32 deact_offset = GIC_CPU_EOI; + + if (static_key_true(supports_deactivate)) { + if (primary_gic_irq(d)) + deact_offset = GIC_CPU_DEACTIVATE; I really wonder for the whole series whether you really want all that static key dance and extra conditionals in the callbacks instead of just using seperate irq chips for the different interrupts. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Fri, 7 Aug 2015, Peter Zijlstra wrote: On Fri, Aug 07, 2015 at 12:57:38PM +0200, Peter Zijlstra wrote: +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait) this one has no users the __ suggests that it is locked edition. Maybe it is for the completions… Yeah, who knows, I certainly do not anymore ;-) On that, we cannot convert completions to swait. Because swait wake_all must not happen from IRQ context, and complete_all() typically is used from just that. Well, completions are not the ones which have a gazillion of waiters and we got quite some performance back from using swait in completions on RT. Thanks, tglx
Re: [PATCH 4/6] irqchip: GIC: Use chip_data instead of handler_data for cascaded interrupts
On Thu, 9 Jul 2015, Marc Zyngier wrote: When used as a primary interrupt controller, the GIC driver uses irq_data-chip_data to extract controller-specific information. When used as a secondary interrupt controller, it uses handler_data instead. As this difference is relatively pointless and only creates confusion, change the secondary path to match what the rest of the driver does. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- drivers/irqchip/irq-gic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index e264675..3c7f3a4 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -298,7 +298,7 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) static void gic_handle_cascade_irq(unsigned int irq, struct irq_desc *desc) { - struct gic_chip_data *chip_data = irq_get_handler_data(irq); + struct gic_chip_data *chip_data = irq_get_chip_data(irq); struct irq_chip *chip = irq_get_chip(irq); You should make that chip_data = irq_desc_get_chip_data(desc); chip = irq_desc_get_chip(desc); That avoids two pointless lookups of irqdesc Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()
On Sat, 20 Jun 2015, Andy Lutomirski wrote: On Sat, Jun 20, 2015 at 7:14 AM, Thomas Gleixner t...@linutronix.de wrote: On Sat, 20 Jun 2015, walter harms wrote: Acked-by: walter harms wha...@bfs.de Am 17.06.2015 02:35, schrieb Andy Lutomirski: This is only used if BAYCOM_DEBUG is defined. So why don't we just replace that by ktime_get() and get rid of the x86'ism in that driver. I don't have the hardware, and I don't see any good reason to make an rdtsc cleanup depend on a more complicated driver change. On the other hand, if the maintainers want to clean it up, I think it would be a great idea. This really seems to be debugging code, though. A normal kernel won't even compile it. Right, but there is no reason that we have rdtsc outside of arch/x86 at all. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in
Re: [PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()
On Sat, 20 Jun 2015, walter harms wrote: Acked-by: walter harms wha...@bfs.de Am 17.06.2015 02:35, schrieb Andy Lutomirski: This is only used if BAYCOM_DEBUG is defined. So why don't we just replace that by ktime_get() and get rid of the x86'ism in that driver. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On Tue, 16 Jun 2015, Juergen Gross wrote: AFAIK there are no outstanding questions for more than one month now. I'd appreciate some feedback or accepting these patches. They are against dead code, which will be gone soon. We switched over to queued locks. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v13 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM
On Wed, 29 Oct 2014, Waiman Long wrote: AIM7 XFS Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 25423737.08 98.95 5.44 PV qspinlock 25495757.06 98.63 5.40 unfairlock 26162796.91 97.05 5.42 AIM7 XFS Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 64446827.93 415.22 6.33 PV qspinlock 64562427.88 419.84 0.39 That number is made up by what? unfairlock 69551825.88 377.40 4.09 AIM7 EXT4 Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 19955659.02 103.67 5.76 PV qspinlock 20111738.95 102.15 5.40 unfairlock 20665908.71 98.13 5.46 AIM7 EXT4 Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 47834137.63 495.81 30.78 PV qspinlock 47405837.97 475.74 30.95 unfairlock 56022432.13 398.43 26.27 For the AIM7 disk workload, both PV ticketlock and qspinlock have about the same performance. The unfairlock performs slightly better than the PV lock. Slightly? Taking the PV locks, which are basically the same for the existing ticket locks and your new fangled qlocks as a reference then the so called 'unfair locks' which are just the native locks w/o the PV nonsense are fundamentally better up to a whopping 18% in the ext4/200% overcommit case. See below. EBIZZY-m Test (no overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 3255 10.00 60.65 3.62 PV qspinlock 3318 10.00 54.27 3.60 unfairlock 2833 10.00 26.66 3.09 EBIZZY-m Test (200% overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 841 10.00 71.03 2.37 PV qspinlock 834 10.00 68.27 2.39 unfairlock 865 10.00 27.08 1.51 futextest (no overcommit) kernel kops/s --- PV ticketlock11523 PV qspinlock 12328 unfairlock 9478 futextest (200% overcommit) kernel kops/s --- PV ticketlock 7276 PV qspinlock 7095 unfairlock 5614 The ebizzy and futextest have much higher spinlock contention than the AIM7 disk workload. In this case, the unfairlock performs worse than both the PV ticketlock and qspinlock. The performance of the 2 PV locks are comparable. While I can see that the PV lock stuff performs 13% better for the ebizzy no overcommit case, what about the very interresting numbers for the same test with 200% overcommit? The regular lock has a slightly better performance, but significantly less sys/usr time. How do you explain that? 'Lies, damned lies and statistics' comes to my mind. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Tue, 25 Nov 2014, Rik van Riel wrote: On 11/25/2014 12:21 PM, Marcelo Tosatti wrote: The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. What are the consequences for the realtime kernel of making waitqueue traversal non-preemptible? Is waitqueue traversal something that ever takes so long we need to care about it being non-preemptible? __wake_up lock_irq_save() __wake_up_common() for_each_entry(curr) curr-func() unlock_irq_save() There are two issues here: 1) Long waiter chains. We probably could work around that in some creative way 2) The non default callbacks. default is default_wake_function. The non default callbacks, which are used by nfsd and other stuff are calling the world and some more and take preemptible locks and do other fancy stuff which falls flat on its nose when called under a raw lock on rt So I created the swait replacement which has a smaller memory footprint, less featuritis and a raw lock because its only wakes up, no custom callbacks. The still suffer #1, but most use cases on RT are single waiter scenarios anyway. Hope that helps. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe
On Tue, 25 Nov 2014, Rik van Riel wrote: On 11/25/2014 12:21 PM, Marcelo Tosatti wrote: Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Reduces average cyclictest latency by 3us. Can this patch be merged in the KVM tree, and go upstream via Paolo? Not really as it has RT dependencies Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: export get_xsave_addr
On Mon, 24 Nov 2014, Paolo Bonzini wrote: get_xsave_addr is the API to access XSAVE states, and KVM would like to use it. Export it. Cc: x...@kernel.org Cc: H. Peter Anvin h...@linux.intel.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- Peter, can you please ACK this for inclusion in the KVM tree? Are you content with my acked-by as well? Acked-by: Thomas Gleixner t...@linutronix.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: several messages
On Mon, 10 Nov 2014, Feng Wu wrote: VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt. With VT-d Posted-Interrupts enabled, external interrupts from direct-assigned devices can be delivered to guests without VMM intervention when guest is running in non-root mode. Can you please talk to Jiang and synchronize your work with his refactoring of the x86 interrupt handling subsystem. I want this stuff cleaned up first before we add new stuff to it. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
On Mon, 22 Sep 2014, Paolo Bonzini wrote: On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. In that case, KVM will fail to patch VMCALL instructions to VMMCALL as required on AMD processors. The failure mode is currently a divide-by-zero exception, which obviously is a KVM bug that has to be fixed. However, picking the right instruction between VMCALL and VMMCALL will be faster and will help if you cannot upgrade the hypervisor. -/* This instruction is vmcall. On non-VT architectures, it will generate a - * trap that we will then rewrite to the appropriate instruction. +#ifdef CONFIG_DEBUG_RODATA +#define KVM_HYPERCALL \ +ALTERNATIVE(.byte 0x0f,0x01,0xc1, .byte 0x0f,0x01,0xd9, X86_FEATURE_VMMCALL) If we can do it via a feature bit and alternatives, then why do you want to patch it manually if CONFIG_DEBUG_RODATA=n? Just because more #ifdeffery makes the code more readable? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Thu, 4 Sep 2014, Paolo Bonzini wrote: Il 04/09/2014 22:58, Thomas Gleixner ha scritto: This is simply wrong. It is. Now I have no idea why you think it needs to add xtime_sec. If the result is wrong, then we need to figure out which one of the supplied values is wrong and not blindly add xtime_sec just because that makes it magically correct. Can you please provide a proper background why you think that adding xtime_sec is a good idea? It's not a good idea indeed. I didn't fully digest the 3.16-3.17 timekeeping changes and messed up this patch. However, there is a bug in the base_mono + offs_boot formula, given that: - bisection leads to the merge commit of John's timers branch - bisecting within John's timers branch, with a KVM commit on top to make the code much easier to trigger, leads to commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based, 2014-07-16). - I backported your patch to 3.16, using wall_to_monotonic + total_sleep_time + xtime_sec (wtm+xtime_sec as in pre-cbcf2dd3b3d4 code, total_sleep_time from 3.16 monotonic_to_bootbased) and it works - In v2 of the patch I fixed the bug by changing the formula base_mono + offs_boot to offs_boot - offs_real (and then adding xtime_sec separately as in the 3.16 backport), but the two formulas base_mono + offs_boot and offs_boot - offs_real + xtime_sec ought to be identical. So lets look at the differences here: 3.163.17 inject_sleeptime(delta) inject_sleeptime(delta) xtime += delta; xtime += delta; wall_to_mono -= delta; wall_to_mono -= delta; offs_real = -wall_to_mono; offs_real = -wall_to_mono; sleeptime += delta; offs_boot = sleeptime; offs_boot += delta; getboottime() tmp = wall_to_mono + sleeptime; boottime = -tmp; So: boottime = -wall_to_mono - sleeptime; Because of the above: offs_real = -wall_to_mono; offs_boot = sleeptime; The result is: boottime = offs_real - offs_boot; boottime = offs_real - offs_boot; monotomic_to_bootbased(mono)monotomic_to_bootbased(mono) return mono + sleeptime; Because of the above: offs_boot = sleeptime; The result is: return mono + offs_boot; return mono + offs_boot; Now on KVM side he have update_pvclock_gtod() update_pvclock_gtod() mono_base = xtime + wall_to_mono; boot_base = mono_base + offs_boot; and do_monotonic() do_monotonic_boot() mono_now = mono_base + delta_ns; boot_now = boot_base + delta_ns; kvm_get_time_and_clockread() mono_now = do_monotonic() boot_now = mono_to_boot(mono_now); So that means on both side the same: boot_now = mono_base + offs_boot + delta_ns; So that means the code is correct. Now where is the bug? Well hidden and still so obvious that it's even visible through the brown paperpag I'm wearing ... Thanks, tglx diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index fb4a9c2cf8d9..ec1791fae965 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -442,11 +442,12 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) tk-ntp_error = 0; ntp_clear(); } - update_vsyscall(tk); - update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); tk_update_ktime_data(tk); + update_vsyscall(tk); + update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); + if (action TK_MIRROR) memcpy(shadow_timekeeper, tk_core.timekeeper, sizeof(tk_core.timekeeper)); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Fri, 5 Sep 2014, Paolo Bonzini wrote: Il 05/09/2014 17:14, Thomas Gleixner ha scritto: So that means the code is correct. Now where is the bug? In kernel/time/timekeeping.c? We know that we should have base_mono = wall_to_monotonic + xtime_sec Instead it is base_mono = wall_to_monotonic + xtime_sec - seconds from boot time which is... zero. Given this is the only use of base_mono in a notifier, I wonder if it is as simple as this (which I don't have time to test right now): diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index fb4a9c2cf8d9..f6807a85b8c9 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -443,9 +443,9 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) ntp_clear(); } update_vsyscall(tk); - update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); tk_update_ktime_data(tk); + update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); Why are you moving the update between vsycall and pvclock update as I did in my patch? We really need to update everything before calling somewhere. And yes it is that simple. I instrumented the stuff and its correct now. Thanks tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Fri, 5 Sep 2014, Paolo Bonzini wrote: Il 05/09/2014 20:33, Thomas Gleixner ha scritto: update_vsyscall(tk); -update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); tk_update_ktime_data(tk); +update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); Why are you moving the update between vsycall and pvclock update as I did in my patch? We really need to update everything before calling somewhere. Do you mean the call should be moved not just after tk_update_ktime_data (which sets base_mono), but further down after update_fast_timekeeper(tk); No, it needs to be above update_vsyscall(). Here is the patch again which I sent before. [https://lkml.org/lkml/2014/9/5/395] Thanks, tglx diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index fb4a9c2cf8d9..ec1791fae965 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -442,11 +442,12 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) tk-ntp_error = 0; ntp_clear(); } - update_vsyscall(tk); - update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); tk_update_ktime_data(tk); + update_vsyscall(tk); + update_pvclock_gtod(tk, action TK_CLOCK_WAS_SET); + if (action TK_MIRROR) memcpy(shadow_timekeeper, tk_core.timekeeper, sizeof(tk_core.timekeeper)); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM changes for 3.17-rc4
On Fri, 5 Sep 2014, Linus Torvalds wrote: On Fri, Sep 5, 2014 at 3:16 AM, Paolo Bonzini pbonz...@redhat.com wrote: are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus Nothing new there. Forgot to push out, or perhaps to use -f to overwrite the previous tag of the same name? And even if there would be something, please do not pull the top most commit b11ba8c62be3eb (KVM: x86: fix kvmclock breakage from timers branch merge). That one is blantanly wrong and just hacks badly around a brown paperbag bug in the core timekeeping code, which I introduced in the last overhaul. It's debugged and understood. Fix is posted and if confirmed by the KVM folks it will go to you before rc4. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM changes for 3.17-rc4
On Fri, 5 Sep 2014, Paolo Bonzini wrote: Il 05/09/2014 22:58, Thomas Gleixner ha scritto: Nothing new there. Forgot to push out, or perhaps to use -f to overwrite the previous tag of the same name? It's there now. Probably a --dry-run too much (I have push=+refs/tags/for-linus:refs/tags/for-linus in the remote configuration). And even if there would be something, please do not pull the top most commit b11ba8c62be3eb (KVM: x86: fix kvmclock breakage from timers branch merge). That one is blantanly wrong and just hacks badly around a brown paperbag bug in the core timekeeping code, which I introduced in the last overhaul. It's not wrong, it's just different. The commit message says clearly Right, it's different. Because you paper at the receiving end over a core bug and that's wrong to begin with. that besides acting as a workaround, I find the patched code easier to understand, and I clearly stated the same in the tag message. Well, we might have different opinions about easier to understand. I did go a great length to distangle the monotonic boot time on which you are interested from xtime, because the latter does not make any sense outside of the core timekeeping code. Aside of that I optimized the whole thing to avoid conversions, loops and hoops. So you just add another multiply and add to make it more understandable. Sigh. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM changes for 3.17-rc4
On Fri, 5 Sep 2014, Paolo Bonzini wrote: Il 05/09/2014 23:24, Thomas Gleixner ha scritto: that besides acting as a workaround, I find the patched code easier to understand, and I clearly stated the same in the tag message. Well, we might have different opinions about easier to understand. I did go a great length to distangle the monotonic boot time on which you are interested from xtime, because the latter does not make any sense outside of the core timekeeping code. Aside of that I optimized the whole thing to avoid conversions, loops and hoops. So you just add another multiply and add to make it more understandable. Fair enough, I've dropped the patch. Though I agree that the struct member name choice is not the most brilliant one, so feel free to fix that. Thanks for helping out with the core timekeeping fix. I'm still banging my head against the wall -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Thu, 4 Sep 2014, Paolo Bonzini wrote: Commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based, 2014-07-16) forgot to add tk-xtime_sec, thus breaking kvmclock on Errm. How is boottime related to xtime_sec? hosts that have a reliable TSC. Add it back; and since the field boot_ns is not anymore related to the host boot-based clock, rename boot_ns-nsec_base and the existing nsec_base-snsec_base. This is simply wrong. The original code before that changed did: vdata-monotonic_time_sec = tk-xtime_sec + tk-wall_to_monotonic.tv_sec; vdata-monotonic_time_snsec = tk-xtime_nsec + (tk-wall_to_monotonic.tv_nsec tk-shift); So this is the momentary monotonic base time And the readout function did: ts-tv_nsec = 0; do { seq = read_seqcount_begin(gtod-seq); mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-monotonic_time_sec; ns = gtod-monotonic_time_snsec; ns += vgettsc(cycle_now); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); timespec_add_ns(ts, ns); So this does: now = monotonic_base + delta_nsec And the caller converted it to boot time with: monotonic_to_bootbased(ts); So the time calculation does: now = monotonic_base + delta_nsec + mono_to_boot Because: monotonic_base + mono_to_boot = boot_time_base The calculation can be written as: now = boot_time_base + delta_nsec The new code does boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot)); So thats boot_time_base = monotonic_base + mono_to_boot; vdata-boot_ns = boot_ns; vdata-nsec_base= tk-tkr.xtime_nsec; And the readout does: do { seq = read_seqcount_begin(gtod-seq); mode = gtod-clock.vclock_mode; ns = gtod-nsec_base; ns += vgettsc(cycle_now); ns = gtod-clock.shift; ns += gtod-boot_ns; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); *t = ns; Which is: boot_time_base + delta_nsec Now I have no idea why you think it needs to add xtime_sec. If the result is wrong, then we need to figure out which one of the supplied values is wrong and not blindly add xtime_sec just because that makes it magically correct. Can you please provide a proper background why you think that adding xtime_sec is a good idea? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Thu, 4 Sep 2014, Paolo Bonzini wrote: Commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based, 2014-07-16) used the wrong formula for boot_ns, thus breaking kvmclock on hosts that have a reliable TSC. To find the right formula, let's first backport the switch to nanoseconds to 3.16-era timekeeping logic. The full patch (which works) is at https://lkml.org/lkml/2014/9/4/462. The key line here is boot_ns = timespec_to_ns(tk-total_sleep_time) + timespec_to_ns(tk-wall_to_monotonic) + tk-xtime_sec * (u64)NSEC_PER_SEC; Because the above patch works, the conclusion is that the above formula is not the same as commit cbcf2dd3b3d4's boot_ns = ktime_to_ns(ktime_add(tk-tkr.base_mono, tk-offs_boot)); As to what is the right one, commit 02cba1598a2a (timekeeping: Simplify getboottime(), 2014-07-16) provides a hint: offs_real = -wall-to_monotonic offs_boot = total_sleep_time offs_real - offs_boot = -wall_to_monotonic - total_sleep_time that is offs_boot - offs_real = wall_to_monotonic + total_sleep_time which is what this patch uses, adding xtime_sec separately. The boot_ns moniker is not too clear, so rename boot_ns to nsec_base and the existing nsec_base to snsec_base. Cc: Thomas Gleixner t...@linutronix.de Cc: John Stultz john.stu...@linaro.org Reported-by: Chris J Arges chris.j.ar...@canonical.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- Thomas/John, the problem with the above explanation is that tk_update_ktime_data has base_mono = xtime_sec + wtm, and from there base_mono + offs_boot = xtime_sec + wtm + total_sleep_time. Except that doesn't work, so something must be wrong in tk_update_ktime_data's comment. Right. I'm staring into it and we need to fix the core code issue and not the usage site. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge
On Thu, 4 Sep 2014, Paolo Bonzini wrote: Il 04/09/2014 22:58, Thomas Gleixner ha scritto: This is simply wrong. It is. Now I have no idea why you think it needs to add xtime_sec. If the result is wrong, then we need to figure out which one of the supplied values is wrong and not blindly add xtime_sec just because that makes it magically correct. Can you please provide a proper background why you think that adding xtime_sec is a good idea? It's not a good idea indeed. I didn't fully digest the 3.16-3.17 timekeeping changes and messed up this patch. However, there is a bug in the base_mono + offs_boot formula, given that: - bisection leads to the merge commit of John's timers branch - bisecting within John's timers branch, with a KVM commit on top to make the code much easier to trigger, leads to commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based, 2014-07-16). - I backported your patch to 3.16, using wall_to_monotonic + total_sleep_time + xtime_sec (wtm+xtime_sec as in pre-cbcf2dd3b3d4 code, total_sleep_time from 3.16 monotonic_to_bootbased) and it works - In v2 of the patch I fixed the bug by changing the formula base_mono + offs_boot to offs_boot - offs_real (and then adding xtime_sec separately as in the 3.16 backport), but the two formulas base_mono + offs_boot and offs_boot - offs_real + xtime_sec ought to be identical. I find offs_boot - offs_real + xtime more readable than the alternative base_mono + offs_boot + xtime_nsec, so the fix doubles as a cleanup for me and I'm fine with it. But something must be wrong in the timekeeping code. I think I have a vague idea what happened, but I'm way too tired now to write it up fully. I'll do that tomorrow morning with brain awake. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch V2 43/64] x86: kvm: Use ktime_get_boot_ns()
Use the new nanoseconds based interface and get rid of the timespec conversion dance. Signed-off-by: Thomas Gleixner t...@linutronix.de Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org --- arch/x86/kvm/x86.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) Index: tip/arch/x86/kvm/x86.c === --- tip.orig/arch/x86/kvm/x86.c +++ tip/arch/x86/kvm/x86.c @@ -1109,11 +1109,7 @@ static void kvm_get_time_scale(uint32_t static inline u64 get_kernel_ns(void) { - struct timespec ts; - - ktime_get_ts(ts); - monotonic_to_bootbased(ts); - return timespec_to_ns(ts); + return ktime_get_boot_ns(); } #ifdef CONFIG_X86_64 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch V2 44/64] x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based
Convert the relevant base data right away to nanoseconds instead of doing the conversion on every readout. Reduces text size by 160 bytes. Signed-off-by: Thomas Gleixner t...@linutronix.de Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org --- arch/x86/kvm/x86.c | 44 ++-- 1 file changed, 14 insertions(+), 30 deletions(-) Index: tip/arch/x86/kvm/x86.c === --- tip.orig/arch/x86/kvm/x86.c +++ tip/arch/x86/kvm/x86.c @@ -984,9 +984,8 @@ struct pvclock_gtod_data { u32 shift; } clock; - /* open coded 'struct timespec' */ - u64 monotonic_time_snsec; - time_t monotonic_time_sec; + u64 boot_ns; + u64 nsec_base; }; static struct pvclock_gtod_data pvclock_gtod_data; @@ -994,6 +993,9 @@ static struct pvclock_gtod_data pvclock_ static void update_pvclock_gtod(struct timekeeper *tk) { struct pvclock_gtod_data *vdata = pvclock_gtod_data; + u64 boot_ns; + + boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot)); write_seqcount_begin(vdata-seq); @@ -1004,17 +1006,8 @@ static void update_pvclock_gtod(struct t vdata-clock.mult = tk-mult; vdata-clock.shift = tk-shift; - vdata-monotonic_time_sec = tk-xtime_sec - + tk-wall_to_monotonic.tv_sec; - vdata-monotonic_time_snsec = tk-xtime_nsec - + (tk-wall_to_monotonic.tv_nsec -tk-shift); - while (vdata-monotonic_time_snsec = - (((u64)NSEC_PER_SEC) tk-shift)) { - vdata-monotonic_time_snsec -= - ((u64)NSEC_PER_SEC) tk-shift; - vdata-monotonic_time_sec++; - } + vdata-boot_ns = boot_ns; + vdata-nsec_base= tk-xtime_nsec; write_seqcount_end(vdata-seq); } @@ -1371,23 +1364,22 @@ static inline u64 vgettsc(cycle_t *cycle return v * gtod-clock.mult; } -static int do_monotonic(struct timespec *ts, cycle_t *cycle_now) +static int do_monotonic_boot(s64 *t, cycle_t *cycle_now) { + struct pvclock_gtod_data *gtod = pvclock_gtod_data; unsigned long seq; - u64 ns; int mode; - struct pvclock_gtod_data *gtod = pvclock_gtod_data; + u64 ns; - ts-tv_nsec = 0; do { seq = read_seqcount_begin(gtod-seq); mode = gtod-clock.vclock_mode; - ts-tv_sec = gtod-monotonic_time_sec; - ns = gtod-monotonic_time_snsec; + ns = gtod-nsec_base; ns += vgettsc(cycle_now); ns = gtod-clock.shift; + ns += gtod-boot_ns; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); - timespec_add_ns(ts, ns); + *t = ns; return mode; } @@ -1395,19 +1387,11 @@ static int do_monotonic(struct timespec /* returns true if host is using tsc clocksource */ static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now) { - struct timespec ts; - /* checked again under seqlock below */ if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) return false; - if (do_monotonic(ts, cycle_now) != VCLOCK_TSC) - return false; - - monotonic_to_bootbased(ts); - *kernel_ns = timespec_to_ns(ts); - - return true; + return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC; } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 43/55] x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based
Convert the relevant base data right away to nanoseconds instead of doing the conversion on every readout. Reduces text size by 160 bytes. Signed-off-by: Thomas Gleixner t...@linutronix.de Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org --- arch/x86/kvm/x86.c | 44 ++-- 1 file changed, 14 insertions(+), 30 deletions(-) Index: tip/arch/x86/kvm/x86.c === --- tip.orig/arch/x86/kvm/x86.c +++ tip/arch/x86/kvm/x86.c @@ -984,9 +984,8 @@ struct pvclock_gtod_data { u32 shift; } clock; - /* open coded 'struct timespec' */ - u64 monotonic_time_snsec; - time_t monotonic_time_sec; + u64 boot_ns; + u64 nsec_base; }; static struct pvclock_gtod_data pvclock_gtod_data; @@ -994,6 +993,9 @@ static struct pvclock_gtod_data pvclock_ static void update_pvclock_gtod(struct timekeeper *tk) { struct pvclock_gtod_data *vdata = pvclock_gtod_data; + u64 boot_ns; + + boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot)); write_seqcount_begin(vdata-seq); @@ -1004,17 +1006,8 @@ static void update_pvclock_gtod(struct t vdata-clock.mult = tk-mult; vdata-clock.shift = tk-shift; - vdata-monotonic_time_sec = tk-xtime_sec - + tk-wall_to_monotonic.tv_sec; - vdata-monotonic_time_snsec = tk-xtime_nsec - + (tk-wall_to_monotonic.tv_nsec -tk-shift); - while (vdata-monotonic_time_snsec = - (((u64)NSEC_PER_SEC) tk-shift)) { - vdata-monotonic_time_snsec -= - ((u64)NSEC_PER_SEC) tk-shift; - vdata-monotonic_time_sec++; - } + vdata-boot_ns = boot_ns; + vdata-nsec_base= tk-xtime_nsec; write_seqcount_end(vdata-seq); } @@ -1371,23 +1364,22 @@ static inline u64 vgettsc(cycle_t *cycle return v * gtod-clock.mult; } -static int do_monotonic(struct timespec *ts, cycle_t *cycle_now) +static int do_monotonic_boot(s64 *t, cycle_t *cycle_now) { + struct pvclock_gtod_data *gtod = pvclock_gtod_data; unsigned long seq; - u64 ns; int mode; - struct pvclock_gtod_data *gtod = pvclock_gtod_data; + u64 ns; - ts-tv_nsec = 0; do { seq = read_seqcount_begin(gtod-seq); mode = gtod-clock.vclock_mode; - ts-tv_sec = gtod-monotonic_time_sec; - ns = gtod-monotonic_time_snsec; + ns = gtod-nsec_base; ns += vgettsc(cycle_now); ns = gtod-clock.shift; + ns += gtod-boot_ns; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); - timespec_add_ns(ts, ns); + *t = ns; return mode; } @@ -1395,19 +1387,11 @@ static int do_monotonic(struct timespec /* returns true if host is using tsc clocksource */ static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now) { - struct timespec ts; - /* checked again under seqlock below */ if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) return false; - if (do_monotonic(ts, cycle_now) != VCLOCK_TSC) - return false; - - monotonic_to_bootbased(ts); - *kernel_ns = timespec_to_ns(ts); - - return true; + return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC; } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 42/55] x86: kvm: Use ktime_get_boot_ns()
Use the new nanoseconds based interface and get rid of the timespec conversion dance. Signed-off-by: Thomas Gleixner t...@linutronix.de Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org --- arch/x86/kvm/x86.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) Index: tip/arch/x86/kvm/x86.c === --- tip.orig/arch/x86/kvm/x86.c +++ tip/arch/x86/kvm/x86.c @@ -1109,11 +1109,7 @@ static void kvm_get_time_scale(uint32_t static inline u64 get_kernel_ns(void) { - struct timespec ts; - - ktime_get_ts(ts); - monotonic_to_bootbased(ts); - return timespec_to_ns(ts); + return ktime_get_boot_ns(); } #ifdef CONFIG_X86_64 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Implement Batched (group) ticket lock
On Wed, 28 May 2014, Linus Torvalds wrote: If somebody has a P4 still, that's likely the worst case by far. I do, but I'm only using it during winter and only if the ia64 machine does not provide sufficient heating. So you have to wait at least half a year until I'm able to test it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On Thu, 30 May 2013, Raghavendra K T wrote: Here is the branch with pvpspinlock V9 version in github reabsed to 3.10-rc https://github.com/ktraghavendra/linux/tree/pvspinlock_v9 planning post a formal email in a separate thread with link a to this branch (instead of spamming with 19 patches) 19 patches is not really spam if you compare it to the total number of mails per day on LKML. The git tree is nice for people who want to test stuff easily, but if you want people to review and comment patches, then please use mail. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM fixes for 3.5-rc6
On Sat, 14 Jul 2012, Jan Kiszka wrote: On 2012-07-14 04:25, Thomas Gleixner wrote: This patch here is a workaround to unbreak devices assignment in 3.5 after the IRQ layer changes without regressing noticeable /wrt overhead. Yeah, workaround and regression are the proper marketing buzzwords to excuse mindless hackery. It took me a minute to figure out that there is no reason at all to use a threaded interrupt handler for MSI and MSIX. If you folks are so concerned about performance and overhead then someone familiar with that code should have done the obvious change long before the oneshot modifications took place. I can't be bothered to do a performance test myself, but I bet it's making an order of magnitude more of a difference than the oh so noticeable few cycles in irq_finalize_oneshot(). Thanks, tglx Index: linux-2.6/virt/kvm/assigned-dev.c === --- linux-2.6.orig/virt/kvm/assigned-dev.c +++ linux-2.6/virt/kvm/assigned-dev.c @@ -105,7 +105,7 @@ static irqreturn_t kvm_assigned_dev_thre } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) +static irqreturn_t kvm_assigned_dev_msi_handler(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,7 +117,7 @@ static irqreturn_t kvm_assigned_dev_thre #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) +static irqreturn_t kvm_assigned_dev_msix_handler(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; int index = find_index_from_host_irq(assigned_dev, irq); @@ -334,11 +334,6 @@ static int assigned_device_enable_host_i } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msi(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -351,9 +346,8 @@ static int assigned_device_enable_host_m } dev-host_irq = dev-dev-irq; - if (request_threaded_irq(dev-host_irq, kvm_assigned_dev_msi, -kvm_assigned_dev_thread_msi, 0, -dev-irq_name, dev)) { + if (request_irq(dev-host_irq, kvm_assigned_dev_msi_handler, 0, + dev-irq_name, dev)) { pci_disable_msi(dev-dev); return -EIO; } @@ -363,11 +357,6 @@ static int assigned_device_enable_host_m #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id) -{ - return IRQ_WAKE_THREAD; -} - static int assigned_device_enable_host_msix(struct kvm *kvm, struct kvm_assigned_dev_kernel *dev) { @@ -383,10 +372,9 @@ static int assigned_device_enable_host_m return r; for (i = 0; i dev-entries_nr; i++) { - r = request_threaded_irq(dev-host_msix_entries[i].vector, -kvm_assigned_dev_msix, -kvm_assigned_dev_thread_msix, -0, dev-irq_name, dev); + r = request_irq(dev-host_msix_entries[i].vector, + kvm_assigned_dev_msix_handler, + 0, dev-irq_name, dev); if (r) goto err; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM fixes for 3.5-rc6
On Sat, 14 Jul 2012, Jan Kiszka wrote: On 2012-07-14 13:16, Thomas Gleixner wrote: On Sat, 14 Jul 2012, Jan Kiszka wrote: On 2012-07-14 04:25, Thomas Gleixner wrote: This patch here is a workaround to unbreak devices assignment in 3.5 after the IRQ layer changes without regressing noticeable /wrt overhead. Yeah, workaround and regression are the proper marketing buzzwords to excuse mindless hackery. It took me a minute to figure out that there is no reason at all to use a threaded interrupt handler for MSI and MSIX. Thomas, we also explained to you in the cited thread that your simple approach for this doesn't work as is. We will have a proper solution soon, but it takes a bit more than a minute - at least us. And I explained to you in that very thread that the proper solution to avoid the overhead of finalize_oneshot is exaclty the patch I sent to Linus yesterday: The only way we can avoid that, is that we get a hint from the underlying irq chip/ handler setup with an extra flag to tell the core, that it's safe to avoid the ONESHOT/finalize magic. So now it took a full month of ignorance to come up with the mindboggling solution of working around the core change with a private hack instead of sitting down and doing what was said to be the correct solution. And that's what seriously annoys me. Instead of doing it yourself or at least politely poking me to get it done, stuff just gets hacked into submission and sold as the performance regression saviour. Of course you are free to ignore my advice, but that does not mean that I take bullshit from you. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM fixes for 3.5-rc6
On Fri, 13 Jul 2012, Linus Torvalds wrote: On Fri, Jul 13, 2012 at 8:45 AM, Linus Torvalds torva...@linux-foundation.org wrote: At the same time, I do wonder if maybe MSI + IRQF_ONESHOT couldn't be improved. The fact that the KVM people think that the extra overhead of IRQF_ONESHOT is a bad thing for MSI interrupts makes me wonder if maybe this wouldn't be an area the irq layer couldn't be improved on. Maybe the MSI+IRQF_ONESHOT case could be improved. Because MSI is kind of fundamentally one-shot, since it's a message-based irq scheme. So maybe the extra overhead is unnecessary in general, not just in this particular KVM case. Hmm? Thomas, see the commentary of a76beb14123a (KVM: Fix device assignment threaded irq handler). Groan. We already discussed to let the irq chip (in this case MSI) tell the core that it does not need the extra oneshot handling. That way the code which requests an threaded irq with the NULL primary handler works on both MSI and normal interrupts. Untested patch below. Thanks, tglx - Index: linux-2.6/arch/x86/kernel/apic/io_apic.c === --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c +++ linux-2.6/arch/x86/kernel/apic/io_apic.c @@ -3109,6 +3109,7 @@ static struct irq_chip msi_chip = { .irq_set_affinity = msi_set_affinity, #endif .irq_retrigger = ioapic_retrigger_irq, + .flags = IRQCHIP_ONESHOT_SAFE, }; static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int irq) Index: linux-2.6/include/linux/irq.h === --- linux-2.6.orig/include/linux/irq.h +++ linux-2.6/include/linux/irq.h @@ -351,6 +351,7 @@ enum { IRQCHIP_MASK_ON_SUSPEND = (1 2), IRQCHIP_ONOFFLINE_ENABLED = (1 3), IRQCHIP_SKIP_SET_WAKE = (1 4), + IRQCHIP_ONESHOT_SAFE= (1 5), }; /* This include will go away once we isolated irq_desc usage to core code */ Index: linux-2.6/kernel/irq/manage.c === --- linux-2.6.orig/kernel/irq/manage.c +++ linux-2.6/kernel/irq/manage.c @@ -1004,35 +1004,48 @@ __setup_irq(unsigned int irq, struct irq */ if (new-flags IRQF_ONESHOT) { /* -* Unlikely to have 32 resp 64 irqs sharing one line, -* but who knows. +* Drivers are often written to work w/o knowledge +* about the underlying irq chip implementation, so a +* request for a threaded irq without a primary hard +* irq context handler requires the ONESHOT flag to be +* set. Some irq chips like MSI based interrupts are +* per se one shot safe. Check the chip flags, so we +* can avoid the unmask dance at the end of the +* threaded handler for those. */ - if (thread_mask == ~0UL) { - ret = -EBUSY; - goto out_mask; + if (desc-irq_data.chip-flags IRQCHIP_ONESHOT_SAFE) { + new-flags = ~IRQF_ONESHOT; + } else { + /* +* Unlikely to have 32 resp 64 irqs sharing one line, +* but who knows. +*/ + if (thread_mask == ~0UL) { + ret = -EBUSY; + goto out_mask; + } + /* +* The thread_mask for the action is or'ed to +* desc-thread_active to indicate that the +* IRQF_ONESHOT thread handler has been woken, but not +* yet finished. The bit is cleared when a thread +* completes. When all threads of a shared interrupt +* line have completed desc-threads_active becomes +* zero and the interrupt line is unmasked. See +* handle.c:irq_wake_thread() for further information. +* +* If no thread is woken by primary (hard irq context) +* interrupt handlers, then desc-threads_active is +* also checked for zero to unmask the irq line in the +* affected hard irq flow handlers +* (handle_[fasteoi|level]_irq). +* +* The new action gets the first zero bit of +* thread_mask assigned. See the loop above which or's +* all existing action-thread_mask bits. +*/ + new-thread_mask = 1 ffz(thread_mask); }
Re: [GIT PULL] KVM fixes for 3.5-rc6
On Fri, 13 Jul 2012, Linus Torvalds wrote: On Fri, Jul 13, 2012 at 11:28 AM, Thomas Gleixner t...@linutronix.de wrote: We already discussed to let the irq chip (in this case MSI) tell the core that it does not need the extra oneshot handling. That way the code which requests an threaded irq with the NULL primary handler works on both MSI and normal interrupts. So I don't think your patch is quite right. If you want to clear the IRQF_ONESHOT for MSI irq's (and other ones where the interrupt controller is fundamentally ONESHOT), I think you should do it a few lines higher up - *before* you check the does the IRQF_ONESHOT mask match other shared interrupts? Now, irq sharing presumably doesn't happen with MSI, but there's nothing fundamentally wrong with message-based irq schemes that have shared interrupt handlers. I think. Hmm? Shared irqs are not supported by MSI, but yes, the check should be done way up. Makes it less ugly as well :) Thanks, tglx Index: linux-2.6/arch/x86/kernel/apic/io_apic.c === --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c +++ linux-2.6/arch/x86/kernel/apic/io_apic.c @@ -3109,6 +3109,7 @@ static struct irq_chip msi_chip = { .irq_set_affinity = msi_set_affinity, #endif .irq_retrigger = ioapic_retrigger_irq, + .flags = IRQCHIP_ONESHOT_SAFE, }; static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int irq) Index: linux-2.6/include/linux/irq.h === --- linux-2.6.orig/include/linux/irq.h +++ linux-2.6/include/linux/irq.h @@ -351,6 +351,7 @@ enum { IRQCHIP_MASK_ON_SUSPEND = (1 2), IRQCHIP_ONOFFLINE_ENABLED = (1 3), IRQCHIP_SKIP_SET_WAKE = (1 4), + IRQCHIP_ONESHOT_SAFE= (1 5), }; /* This include will go away once we isolated irq_desc usage to core code */ Index: linux-2.6/kernel/irq/manage.c === --- linux-2.6.orig/kernel/irq/manage.c +++ linux-2.6/kernel/irq/manage.c @@ -960,6 +960,18 @@ __setup_irq(unsigned int irq, struct irq } /* +* Drivers are often written to work w/o knowledge about the +* underlying irq chip implementation, so a request for a +* threaded irq without a primary hard irq context handler +* requires the ONESHOT flag to be set. Some irq chips like +* MSI based interrupts are per se one shot safe. Check the +* chip flags, so we can avoid the unmask dance at the end of +* the threaded handler for those. +*/ + if (desc-irq_data.chip-flags IRQCHIP_ONESHOT_SAFE) + new-flags = ~IRQF_ONESHOT; + + /* * The following block of code has to be executed atomically */ raw_spin_lock_irqsave(desc-lock, flags); @@ -1033,7 +1045,8 @@ __setup_irq(unsigned int irq, struct irq */ new-thread_mask = 1 ffz(thread_mask); - } else if (new-handler == irq_default_primary_handler) { + } else if (new-handler == irq_default_primary_handler + !(desc-irq_data.chip-flags IRQCHIP_ONESHOT_SAFE)) { /* * The interrupt was requested with handler = NULL, so * we use the default primary handler for it. But it -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM fixes for 3.5-rc6
On Fri, 13 Jul 2012, Thomas Gleixner wrote: On Fri, 13 Jul 2012, Linus Torvalds wrote: On Fri, Jul 13, 2012 at 8:45 AM, Linus Torvalds torva...@linux-foundation.org wrote: At the same time, I do wonder if maybe MSI + IRQF_ONESHOT couldn't be improved. The fact that the KVM people think that the extra overhead of IRQF_ONESHOT is a bad thing for MSI interrupts makes me wonder if maybe this wouldn't be an area the irq layer couldn't be improved on. Maybe the MSI+IRQF_ONESHOT case could be improved. Because MSI is kind of fundamentally one-shot, since it's a message-based irq scheme. So maybe the extra overhead is unnecessary in general, not just in this particular KVM case. Hmm? Thomas, see the commentary of a76beb14123a (KVM: Fix device assignment threaded irq handler). Groan. We already discussed to let the irq chip (in this case MSI) tell the core that it does not need the extra oneshot handling. That way the code which requests an threaded irq with the NULL primary handler works on both MSI and normal interrupts. That's the kind of stuff which makes me go berserk, and just for the record: the most complaints I get for going berserk are coming from the virt folks. I really can't understand why the virt folks think they are special and do not have to talk to core maintainers. Back then when I was doing the big irq cleanup, virt crap stood out by far with the most idiotic^Wcreative workarounds. Of course nobody complained about me being moronic enough to come up with generic solutions for their problems. Though especially that commit including its changelog proves once again the ignorance and desinterest of the virt crowd in solving problems which are not only relevant to themself. I whish you'd just refused to pull that nonsense and instead made them talk to those folks who had a proper solution in mind already. In fact we could have solved that proper weeks ago, if only people would have raised the issue. I'm tired of that kind of crap, really. Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts
On Sun, 3 Jun 2012, Avi Kivity wrote: On 06/01/2012 09:26 PM, Jan Kiszka wrote: you suggesting we need a request_edge_threaded_only_irq() API? Thanks, I'm just wondering if that restriction for threaded IRQs is really necessary for all use cases we have. Threaded MSIs do not appear to me like have to be handled that conservatively, but maybe I'm missing some detail. btw, I'm hoping we can unthread assigned MSIs. If the delivery is unicast, we can precalculate everything and all the handler has to do is set the IRR, KVM_REQ_EVENT, and kick the vcpu. All of these can be done from interrupt context with just RCU locking. There is really no need to run MSI/MSI-X interrupts threaded for KVM. I'm running the patch below for quite some time and it works like a charm. Thanks, tglx Index: linux-2.6/virt/kvm/assigned-dev.c === --- linux-2.6.orig/virt/kvm/assigned-dev.c +++ linux-2.6/virt/kvm/assigned-dev.c @@ -105,7 +105,7 @@ static irqreturn_t kvm_assigned_dev_thre } #ifdef __KVM_HAVE_MSI -static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id) +static irqreturn_t kvm_assigned_dev_msi_handler(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; @@ -117,7 +117,7 @@ static irqreturn_t kvm_assigned_dev_thre #endif #ifdef __KVM_HAVE_MSIX -static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id) +static irqreturn_t kvm_assigned_dev_msix_handler(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = dev_id; int index = find_index_from_host_irq(assigned_dev, irq); @@ -346,9 +346,8 @@ static int assigned_device_enable_host_m } dev-host_irq = dev-dev-irq; - if (request_threaded_irq(dev-host_irq, NULL, -kvm_assigned_dev_thread_msi, 0, -dev-irq_name, dev)) { + if (request_irq(dev-host_irq, kvm_assigned_dev_msi_handler, 0, + dev-irq_name, dev)) { pci_disable_msi(dev-dev); return -EIO; } @@ -373,9 +372,9 @@ static int assigned_device_enable_host_m return r; for (i = 0; i dev-entries_nr; i++) { - r = request_threaded_irq(dev-host_msix_entries[i].vector, -NULL, kvm_assigned_dev_thread_msix, -0, dev-irq_name, dev); + r = request_irq(dev-host_msix_entries[i].vector, + kvm_assigned_dev_msix_handler, 0, + dev-irq_name, dev); if (r) goto err; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts
On Mon, 4 Jun 2012, Jan Kiszka wrote: On 2012-06-04 13:21, Thomas Gleixner wrote: So this shortcut requires some checks before being applied to a specific MSI/MSI-X vector. Taking KVM aside, my general question remains if threaded MSI handlers of all devices really need to apply IRQF_ONESHOT though they should have no use for it. In theory no, but we had more than one incident, where threaded irqs w/o a primary handler and w/o IRQF_ONEHSOT lead to full system starvation. Linus requested this sanity check and I think it's sane and required. In fact it's a non issue for MSI. MSI uses handle_edge_irq which does not mask the interrupt. IRQF_ONESHOT is a noop for that flow handler. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts
On Mon, 4 Jun 2012, Jan Kiszka wrote: On 2012-06-04 15:07, Thomas Gleixner wrote: On Mon, 4 Jun 2012, Jan Kiszka wrote: On 2012-06-04 13:21, Thomas Gleixner wrote: So this shortcut requires some checks before being applied to a specific MSI/MSI-X vector. Taking KVM aside, my general question remains if threaded MSI handlers of all devices really need to apply IRQF_ONESHOT though they should have no use for it. In theory no, but we had more than one incident, where threaded irqs w/o a primary handler and w/o IRQF_ONEHSOT lead to full system starvation. Linus requested this sanity check and I think it's sane and required. OK. In fact it's a non issue for MSI. MSI uses handle_edge_irq which does not mask the interrupt. IRQF_ONESHOT is a noop for that flow handler. Isn't irq_finalize_oneshot processes for all flows? Right, forgot about that. The only way we can avoid that, is that we get a hint from the underlying irq chip/ handler setup with an extra flag to tell the core, that it's safe to avoid the ONESHOT/finalize magic. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting
On Thu, 24 May 2012, Alex Williamson wrote: On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote: +if (address == msi_start + PCI_MSI_DATA_32) +handle_cfg_write_msi(pci_dev, assigned_dev); Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap + PCI_MSI_DATA_32) to start with? But how does this handle the enable bit? The problem with the current implementation is that it only changes the routing if the msi entry goes from masked to unmasked state. Linux does not mask the entries on affinity changes and never did, neither for MSI nor for MSI-X. I know it's probably not according to the spec, but we can't fix that retroactively. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting
On Thu, 24 May 2012, Alex Williamson wrote: On Thu, 2012-05-24 at 18:53 -0300, Jan Kiszka wrote: On 2012-05-24 18:39, Thomas Gleixner wrote: On Thu, 24 May 2012, Alex Williamson wrote: On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote: +if (address == msi_start + PCI_MSI_DATA_32) +handle_cfg_write_msi(pci_dev, assigned_dev); Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap + PCI_MSI_DATA_32) to start with? But how does this handle the enable bit? The problem with the current implementation is that it only changes the routing if the msi entry goes from masked to unmasked state. Linux does not mask the entries on affinity changes and never did, neither for MSI nor for MSI-X. I know it's probably not according to the spec, but we can't fix that retroactively. For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug, waiting for hardware to dislike this spec violation. However, if this is the current behavior of such a prominent guest, I guess we have to stop optimizing the QEMU MSI-X code that it only updates routings on mask changes. Possibly other OSes get this wrong too... Thanks, for the clarification. Should go into the changelog. Hmm, if Linux didn't mask MSIX before updating vectors it'd not only be a spec violation, but my testing of the recent changes to fix MSIX vector updates for exactly this would have failed... } else if (msix_masked(orig) !msix_masked(entry)) { ... update vector... So I'm not entirely sure I believe that. Thanks, What happens is: A write to /proc/irq/$N/smp_affinity calls into irq_set_affinity() which does: if (irq_can_move_pcntxt(data)) { ret = chip-irq_set_affinity(data, mask, false); } else { irqd_set_move_pending(data); irq_copy_pending(desc, mask); } MSI and MSI-X fall into the !irq_can_move_pcntxt() code path unless the irq is remapped, which is not the case in a guest. That means that we merily copy the new mask and set the move pending bit. MSI/MSI-X use the edge handler so on the next incoming interrupt, we do irq_desc-chip-irq_ack() which ends up in ack_apic_edge() which does: static void ack_apic_edge(struct irq_data *data) { irq_complete_move(data-chip_data); irq_move_irq(data); ack_APIC_irq(); } irq_move_irq() is the interesting function. And that does irq_desc-chip-irq_mask() before calling the irq_set_affinity() function which actually changes the masks. -irq_mask() ends up in mask_msi_irq(). Now that calls msi_set_mask_bit() and for MSI-X that actually masks the irq. So ignore my MSI-X comment. But for MSI this ends up in msi_mask_irq() which does: if (!desc-msi_attrib.maskbit) return 0; So in case desc-msi_attrib.maskbit is 0 we do not write anything out and then the masked/unmasked logic in qemu fails. Sorry, that I did not decode that down to this level before, but I was in a hurry and assumed correctly that qemu is doing something wrong. Not being familiar with that code did not help either :) So the proper fix is that qemu tells the guest that mask bit is supported and catches the mask bit toggling before writing it out to the hardware for those devices which do not support it. We'll have another look. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting
On Fri, 25 May 2012, Michael S. Tsirkin wrote: On Thu, May 24, 2012 at 06:53:15PM -0300, Jan Kiszka wrote: On 2012-05-24 18:39, Thomas Gleixner wrote: On Thu, 24 May 2012, Alex Williamson wrote: On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote: +if (address == msi_start + PCI_MSI_DATA_32) +handle_cfg_write_msi(pci_dev, assigned_dev); Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap + PCI_MSI_DATA_32) to start with? But how does this handle the enable bit? The problem with the current implementation is that it only changes the routing if the msi entry goes from masked to unmasked state. Linux does not mask the entries on affinity changes and never did, neither for MSI nor for MSI-X. I know it's probably not according to the spec, but we can't fix that retroactively. For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug, waiting for hardware to dislike this spec violation. However, if this is the current behavior of such a prominent guest, I guess we have to stop optimizing the QEMU MSI-X code that it only updates routings on mask changes. Possibly other OSes get this wrong too... Very strange, a clear spec violation. I'll have to dig in the source to verify this. Stop digging. MSI-X is correct. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting
On Fri, 25 May 2012, Thomas Gleixner wrote: On Fri, 25 May 2012, Michael S. Tsirkin wrote: On Thu, May 24, 2012 at 06:53:15PM -0300, Jan Kiszka wrote: On 2012-05-24 18:39, Thomas Gleixner wrote: On Thu, 24 May 2012, Alex Williamson wrote: On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote: +if (address == msi_start + PCI_MSI_DATA_32) +handle_cfg_write_msi(pci_dev, assigned_dev); Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap + PCI_MSI_DATA_32) to start with? But how does this handle the enable bit? The problem with the current implementation is that it only changes the routing if the msi entry goes from masked to unmasked state. Linux does not mask the entries on affinity changes and never did, neither for MSI nor for MSI-X. I know it's probably not according to the spec, but we can't fix that retroactively. For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug, waiting for hardware to dislike this spec violation. However, if this is the current behavior of such a prominent guest, I guess we have to stop optimizing the QEMU MSI-X code that it only updates routings on mask changes. Possibly other OSes get this wrong too... Very strange, a clear spec violation. I'll have to dig in the source to verify this. Stop digging. MSI-X is correct. This was based off an older version of qemu-kvm, where the routing for MSI-X was broken for other reasons. But that seems to be fixed now. I use the age excuse :) Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting
On Thu, 24 May 2012, Alex Williamson wrote: On Fri, 2012-05-25 at 01:01 +0200, Thomas Gleixner wrote: So the proper fix is that qemu tells the guest that mask bit is supported and catches the mask bit toggling before writing it out to the hardware for those devices which do not support it. We can't necessarily do that, we have to work with the config space we're give. Using the smallest possible MSI capability always works. Adding mask bits may not fit in with the existing capabilities of the physical device. Thanks, I see what you mean. A random device driver of a random guest OS might rely on that information. Unlikely, but So we need some logic to circumvent the masked/unmasked logic in case that property is not set, right ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Wed, 23 May 2012, Avi Kivity wrote: On 05/22/2012 08:26 PM, Thomas Gleixner wrote: On Tue, 22 May 2012, Avi Kivity wrote: On 05/22/2012 12:04 AM, Thomas Gleixner wrote: The only justification for having the same layout as the actual hardware is when you are going to map the memory into the guest space, which is not the case here. The APIC page is in fact mapped to the hardware (not the guest, but vmx microcode does access it). Only one register, the TPR, is ever used. It's possible to re-layout the data structure so that the TPR stays in the same place while everything else becomes contiguous, but we'll have to do it again if the hardware starts mapping more registers. I would avoid that by having a compressed version which reflects the SW state and the mapped one which allows the vmx microcode to fiddle with the TPR. If you need more registers in the HW page then you don't have to worry about the layout and just have a proper accessor for that. That works, but replaces one problem with another: now we have two sources for the same data, and need to juggle between them depending on register number (either synchronizing in both directions, or special casing); so you're simplifying one thing at the expense of the other. If the microcode starts accessing more registers, then having two layouts becomes even uglier. Fair enough :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Wed, 23 May 2012, Avi Kivity wrote: On 05/23/2012 05:48 PM, Ingo Molnar wrote: This is silly. Most of the time the kernel is advanced by incremental patches. Sometimes it is advanced by minor or major refactoring. It is never advanced by personal attacks on contributors. Thomas wasn't so much doing a personal attack, it was pointing out stupidity and then it was mocking the repeated stupidity. He very politely explained his point of view (with which I agree), I guess we disagree on what is polite or not. Mocking, for example, isn't part of it in my book. I really avoid flaming as far as it goes, but I consider that ignoring a thorough patch review and replying only on the very obvious problem is a massive form of impoliteness. Replying on a still polite reminder with a sloppy I just took what's there and implemeted the optimization which I was tasked with is even more of an offense. It clearly shows that there is no interest in making the whole thing better and just aims for quick and dirty here are my benchmark results success. This is what actually triggered me to switch into grumpy mode. I'm really tired of that attitude. It's the root cause for the steady increasing mess in the kernel. It forces people who actually care to waste an endless amount of time to clean up what has been just slapped into the code base quick and dirty. And you expect those people to stay calm and polite if they get ignored and ridiculed with sloppy replies? You may be a saint, but I'm definitely too old to cope with that. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Wed, 23 May 2012, H. Peter Anvin wrote: On 05/23/2012 11:37 AM, Thomas Gleixner wrote: That works, but replaces one problem with another: now we have two sources for the same data, and need to juggle between them depending on register number (either synchronizing in both directions, or special casing); so you're simplifying one thing at the expense of the other. If the microcode starts accessing more registers, then having two layouts becomes even uglier. Fair enough :) Yes, the µcode accessing this data structure directly probably falls under the category of a legitimate need to stick to the hardware format. Thought more about that. We have a clear distinction between HW accessed data and software accessed data. If I look at TPR then it is special cased already and it does: case APIC_TASKPRI: report_tpr_access(apic, false|true); /* fall thru */ And the fall through is using the general accessor for all not special cased registers. So all you have to do is case APIC_TASKPRI: report_tpr_access(apic, false|true); + return access_mapped_reg(...); Instead of the fall through. So there is no synchronizing back and forth problem simply because you already have a special case for that register. I know you'll argue that the tpr reporting is a special hack for windows guests, at least that's what the changelog tells. But even if we have a few more registers accessed by hardware and if they do not require a special casing, I really doubt that the overhead of special casing those few regs will be worse than not having the obvious optimization in place. And looking deeper it's a total non issue. The apic mapping is 4k. The register stride is strictly 0x10. That makes a total of 256 possible registers. So now you have two possibilites: 1) Create a 256 bit == 64byte bitfield to select the one or the other representation. The overhead of checking the bit is not going to be observable. 2) Create a 256 function pointer array == 2k resp. 1k (64 / 32bit) That's not a lot of memory even if you have to maintain two separate variants for read and write, but it allows you to get rid of the already horribly compiled switch case in apic_read/write and you'll get the optional stuff like report_tpr_access() w/o extra conditionals just for free. An extra goodie is that you can catch any access to a non existing register which you now just silently ignore. And that allows you to react on any future hardware oddities without adding a single runtime conditional. This is stricly x86 and x86 is way better at dealing with indirect calls than with the mess gcc creates compiling those switch case constructs. So I'd go for that and rather spend the memory and the time in setting up the function pointers on init/ioctl than dealing with the inconsistency of HW/SW representation with magic hacks. Thanks, tglx
Re: [PATCH] kvm: optimize ISR lookups
On Wed, 23 May 2012, Michael S. Tsirkin wrote: On Wed, May 23, 2012 at 10:10:27PM +0200, Thomas Gleixner wrote: Replying on a still polite reminder with a sloppy I just took what's there and implemeted the optimization which I was tasked with is even more of an offense. Ow. That 'not my fault' line was a joke. Which I obviously missed to get after spending quite some time to grok the whole thing. I still don't get it. Must be a cultural or an age thing or both. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Tue, 22 May 2012, Avi Kivity wrote: On 05/22/2012 12:04 AM, Thomas Gleixner wrote: The only justification for having the same layout as the actual hardware is when you are going to map the memory into the guest space, which is not the case here. The APIC page is in fact mapped to the hardware (not the guest, but vmx microcode does access it). Only one register, the TPR, is ever used. It's possible to re-layout the data structure so that the TPR stays in the same place while everything else becomes contiguous, but we'll have to do it again if the hardware starts mapping more registers. I would avoid that by having a compressed version which reflects the SW state and the mapped one which allows the vmx microcode to fiddle with the TPR. If you need more registers in the HW page then you don't have to worry about the layout and just have a proper accessor for that. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Mon, 21 May 2012, Michael S. Tsirkin wrote: We perform ISR lookups twice: during interrupt injection and on EOI. Typical workloads only have a single bit set there. So we can avoid ISR scans by 1. counting bits as we set/clear them in ISR 2. if count is 1, caching the vector number 3. if count != 1, invalidating the cache The real purpose of this is enabling PV EOI which needs to quickly validate the vector. But non PV guests also benefit. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- arch/x86/kvm/lapic.c | 51 - arch/x86/kvm/lapic.h |2 + 2 files changed, 51 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 93c1574..232950a 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -107,6 +107,16 @@ static inline void apic_clear_vector(int vec, void *bitmap) clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +static inline int __apic_test_and_set_vector(int vec, void *bitmap) +{ + return __test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + +static inline int __apic_test_and_clear_vector(int vec, void *bitmap) +{ + return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); +} + static inline int apic_hw_enabled(struct kvm_lapic *apic) { return (apic)-vcpu-arch.apic_base MSR_IA32_APICBASE_ENABLE; @@ -210,6 +220,16 @@ static int find_highest_vector(void *bitmap) return fls(word[word_offset 2]) - 1 + (word_offset 5); } +static u8 count_vectors(void *bitmap) +{ + u32 *word = bitmap; + int word_offset; + u8 count = 0; + for (word_offset = 0; word_offset MAX_APIC_VECTOR 5; ++word_offset) + count += hweight32(word[word_offset 2]); + return count; +} Why do you need to reimplement bitmap_weight()? Because your bitmap is not a bitmap, but a set of 32bit registers which have an offset of 4 words each. I really had to twist my brain around this function and the test_and_clr/set ones above just because the name bitmap is so horribly misleading. Add an extra bonus for using void pointers. Yes, I know, the existing code is full of this, but that's not an excuse to add more of it. This emulation is just silly. Nothing in an emulation where the access to the emulated hardware is implemented with vmexits is requiring a 1:1 memory layout. It's completely irrelevant whether the hardware has an ISR regs offset of 0x10 or not. Nothing prevents the emulation to use a consecutive bitmap for the vector bits instead of reimplementing a boatload of bitmap operations. The only justification for having the same layout as the actual hardware is when you are going to map the memory into the guest space, which is not the case here. And if you look deeper, then you'll notice that _ALL_ APIC registers are on a 16 byte boundary (thanks Peter for pointing that out). So it's even more silly to have a 1:1 representation instead of implementing the default emulated apic_read/write functions to access (offset 2). And of course, that design decision causes lookups to be slow. It's way faster to scan a consecutive bitmap than iterating through eight 32bit words with an offset of 0x10, especially on a 64 bit host. static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic) { apic-irr_pending = true; @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) apic-irr_pending = true; } +static inline void apic_set_isr(int vec, struct kvm_lapic *apic) +{ + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR)) + ++apic-isr_count; + ASSERT(apic-isr_count MAX_APIC_VECTOR); I'm really curious what you observed when defining DEBUG in that file. Clearly you never did. + if (likely(apic-isr_count == 1)) + apic-isr_cache = vec; + else + apic-isr_cache = -1; +} + +static inline void apic_clear_isr(int vec, struct kvm_lapic *apic) +{ + if (__apic_test_and_clear_vector(vec, apic-regs + APIC_ISR)) + --apic-isr_count; + ASSERT(apic-isr_count 0); Ditto. result = find_highest_vector(apic-regs + APIC_ISR); ASSERT(result == -1 || result = 16); And obviously none of the folks who added this gem bothered to define DEBUG either. So please instead of working around horrid design decisions and adding more mess to the existing one, can we please cleanup the stuff first and then decide whether it's worth to add the extra magic? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote: @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) apic-irr_pending = true; } +static inline void apic_set_isr(int vec, struct kvm_lapic *apic) +{ + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR)) + ++apic-isr_count; + ASSERT(apic-isr_count MAX_APIC_VECTOR); I'm really curious what you observed when defining DEBUG in that file. Clearly you never did. Sorry :( Yes clearly silly, thanks for pointing this out. That's all you have a reply for? That's the least of the problems -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
Michael, On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote: On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote: @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) apic-irr_pending = true; } +static inline void apic_set_isr(int vec, struct kvm_lapic *apic) +{ + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR)) + ++apic-isr_count; + ASSERT(apic-isr_count MAX_APIC_VECTOR); I'm really curious what you observed when defining DEBUG in that file. Clearly you never did. Sorry :( Yes clearly silly, thanks for pointing this out. That's all you have a reply for? That's the least of the problems Others are not my fault :) Interesting. The other changes you added are not your fault? So you provided a patch which you didn't author completely. You merily added the bogus ASSERTs, right ? Can you please explain why there is only your SOB on that patch? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote: On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote: @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) apic-irr_pending = true; } +static inline void apic_set_isr(int vec, struct kvm_lapic *apic) +{ + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR)) + ++apic-isr_count; + ASSERT(apic-isr_count MAX_APIC_VECTOR); I'm really curious what you observed when defining DEBUG in that file. Clearly you never did. Sorry :( Yes clearly silly, thanks for pointing this out. That's all you have a reply for? That's the least of the problems Others are not my fault :) Seriously, if Avi/Marcelo want to rewrite the ISR emulation Interesting POV, really. Did you ever notice that the kernel is a collaborative effort and not controlled by Avi/Marcelo? Did you ever notice that arch/x86/kvm is part of arch/x86? And if you did, did you notice that there are maintainers responsible for that code base? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote: On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote: @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic) apic-irr_pending = true; } +static inline void apic_set_isr(int vec, struct kvm_lapic *apic) +{ + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR)) + ++apic-isr_count; + ASSERT(apic-isr_count MAX_APIC_VECTOR); I'm really curious what you observed when defining DEBUG in that file. Clearly you never did. Sorry :( Yes clearly silly, thanks for pointing this out. That's all you have a reply for? That's the least of the problems Others are not my fault :) Seriously, if Avi/Marcelo want to rewrite the ISR emulation I can help. If they want to keep it 1:1 with hardware then what I wrote seems the only way. Seriously. You submitted a code monkey patch, solving a problem by curing the symptom, but not the root cause. And then you hide behind Avi and Marcelo? Did you ever think about the real problem of that lapic emulation? Let's assume you did and it occured to you that the whole thing is wrong and backwards, then you still insist on adding more bullshit to that file instead of optimizing and fixing it in the first place? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: optimize ISR lookups
On Tue, 22 May 2012, Michael S. Tsirkin wrote: On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote: On Mon, 21 May 2012, Michael S. Tsirkin wrote: +static u8 count_vectors(void *bitmap) +{ + u32 *word = bitmap; + int word_offset; + u8 count = 0; + for (word_offset = 0; word_offset MAX_APIC_VECTOR 5; ++word_offset) + count += hweight32(word[word_offset 2]); + return count; +} Why do you need to reimplement bitmap_weight()? Because your bitmap is not a bitmap, It's an existing bitmap. It's not my bitmap :) Dammit. You added a function: +static u8 count_vectors(void *bitmap) So it's YOUR bitmap. It's YOUR fault that you copied mindlessly the existing crap. And you just copied it to push performance without even thinking about the underlying problems. And now you expect me to accept this just because someone else missed to use his brain when implementing the exisiting nonsense ? Yes, I know, the existing code is full of this, but that's not an excuse to add more of it. There's no other way to use existing code. If current code is reworked to use bitmap.h then my patch can use it too. Then sit down and rework the existing code instead of insisting on something which makes the code worse than it is already. This emulation is just silly. Nothing in an emulation where the access to the emulated hardware is implemented with vmexits is requiring a 1:1 memory layout. It's completely irrelevant whether the hardware has an ISR regs offset of 0x10 or not. Nothing prevents the emulation to use a consecutive bitmap for the vector bits instead of reimplementing a boatload of bitmap operations. The only justification for having the same layout as the actual hardware is when you are going to map the memory into the guest space, which is not the case here. I think the reason is __apic_read which now simply copies the registers out to guest, this code will become less straight-forward if it's not 1:1. Oh yes. You didn't even read the few lines below, which explain what's wrong and why the existing code is less straight forward than optimized code. And if you look deeper, then you'll notice that _ALL_ APIC registers are on a 16 byte boundary (thanks Peter for pointing that out). So it's even more silly to have a 1:1 representation instead of implementing the default emulated apic_read/write functions to access (offset 2). And of course, that design decision causes lookups to be slow. Yes, it might be one of the reasons why my patch helps so much: it adds a cache in front of this data structure. Obviously you didn't even try to answer my obresvations/questions, you are just justfying your hackery. So what you propose is in fact to rework apic registers at least for ISR,IRR,TMR to use a bitmap. This is the final confirmation that you never tried to understand my reply. Hint: s/at least.*// I am fine with this suggestion but would like some feedback from kvm maintainers on where they want to go before I spend time on that. Are the kvm maintainers controlling your common sense or what ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On Mon, 7 May 2012, Ingo Molnar wrote: * Avi Kivity a...@redhat.com wrote: PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. If both Xen and KVM makes good use of it then that's a different matter. Aside of that, it's kinda strange that a dude named Nikunj is referenced in the argument chain, but I can't find him on the CC list. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On Sun, 1 Apr 2012, Avi Kivity wrote: On 03/31/2012 01:07 AM, Thomas Gleixner wrote: On Fri, 30 Mar 2012, H. Peter Anvin wrote: What is the current status of this patchset? I haven't looked at it too closely because I have been focused on 3.4 up until now... The real question is whether these heuristics are the correct approach or not. If I look at it from the non virtualized kernel side then this is ass backwards. We know already that we are holding a spinlock which might cause other (v)cpus going into eternal spin. The non virtualized kernel solves this by disabling preemption and therefor getting out of the critical section as fast as possible, The virtualization problem reminds me a lot of the problem which RT kernels are observing where non raw spinlocks are turned into sleeping spinlocks and therefor can cause throughput issues for non RT workloads. Though the virtualized situation is even worse. Any preempted guest section which holds a spinlock is prone to cause unbound delays. The paravirt ticketlock solution can only mitigate the problem, but not solve it. With massive overcommit there is always a way to trigger worst case scenarious unless you are educating the scheduler to cope with that. So if we need to fiddle with the scheduler and frankly that's the only way to get a real gain (the numbers, which are achieved by this patches, are not that impressive) then the question arises whether we should turn the whole thing around. I know that Peter is going to go berserk on me, but if we are running a paravirt guest then it's simple to provide a mechanism which allows the host (aka hypervisor) to check that in the guest just by looking at some global state. So if a guest exits due to an external event it's easy to inspect the state of that guest and avoid to schedule away when it was interrupted in a spinlock held section. That guest/host shared state needs to be modified to indicate the guest to invoke an exit when the last nested lock has been released. Interesting idea (I think it has been raised before btw, don't recall by who). Someoen posted a pointer to that old thread. One thing about it is that it can give many false positives. Consider a fine-grained spinlock that is being accessed by many threads. That is, the lock is taken and released with high frequency, but there is no contention, because each vcpu is accessing a different instance. So the host scheduler will needlessly delay preemption of vcpus that happen to be holding a lock, even though this gains nothing. You're talking about per cpu locks, right? I can see the point there, but per cpu stuff might be worth annotating to avoid this. A second issue may happen with a lock that is taken and released with high frequency, with a high hold percentage. The host scheduler may always sample the guest in a held state, leading it to conclude that it's exceeding its timeout when in fact the lock is held for a short time only. Well, no. You can avoid that. hostVCPU spin_lock() modify_global_state() exit check_global_state() mark_global_state() reschedule vcpu spin_unlock() check_global_state() trigger_exit() So when an exit occures in the locked section, then the host can mark the global state to tell the guest to issue a trap on unlock. Of course this needs to be time bound, so a rogue guest cannot monopolize the cpu forever, but that's the least to worry about problem simply because a guest which does not get out of a spinlocked region within a certain amount of time is borked and elegible to killing anyway. Hopefully not killing! Just because a guest doesn't scale well, or even if it's deadlocked, doesn't mean it should be killed. Just preempt it. :) Thoughts ? It's certainly interesting. Maybe a combination is worthwhile - prevent lockholder preemption for a short period of time AND put waiters to sleep in case that period is insufficient to release the lock. Right, but as Srivatsa pointed out this still needs the ticket lock ordering support to avoid that guests are completely starved. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On Fri, 30 Mar 2012, H. Peter Anvin wrote: What is the current status of this patchset? I haven't looked at it too closely because I have been focused on 3.4 up until now... The real question is whether these heuristics are the correct approach or not. If I look at it from the non virtualized kernel side then this is ass backwards. We know already that we are holding a spinlock which might cause other (v)cpus going into eternal spin. The non virtualized kernel solves this by disabling preemption and therefor getting out of the critical section as fast as possible, The virtualization problem reminds me a lot of the problem which RT kernels are observing where non raw spinlocks are turned into sleeping spinlocks and therefor can cause throughput issues for non RT workloads. Though the virtualized situation is even worse. Any preempted guest section which holds a spinlock is prone to cause unbound delays. The paravirt ticketlock solution can only mitigate the problem, but not solve it. With massive overcommit there is always a way to trigger worst case scenarious unless you are educating the scheduler to cope with that. So if we need to fiddle with the scheduler and frankly that's the only way to get a real gain (the numbers, which are achieved by this patches, are not that impressive) then the question arises whether we should turn the whole thing around. I know that Peter is going to go berserk on me, but if we are running a paravirt guest then it's simple to provide a mechanism which allows the host (aka hypervisor) to check that in the guest just by looking at some global state. So if a guest exits due to an external event it's easy to inspect the state of that guest and avoid to schedule away when it was interrupted in a spinlock held section. That guest/host shared state needs to be modified to indicate the guest to invoke an exit when the last nested lock has been released. Of course this needs to be time bound, so a rogue guest cannot monopolize the cpu forever, but that's the least to worry about problem simply because a guest which does not get out of a spinlocked region within a certain amount of time is borked and elegible to killing anyway. Thoughts ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvmclock: abstract save/restore sched_clock_state
On Tue, 7 Feb 2012, Marcelo Tosatti wrote: Upon resume from hibernation, CPU 0's hvclock area contains the old values for system_time and tsc_timestamp. It is necessary for the hypervisor to update these values with uptodate ones before the CPU uses them. Abstract TSC's save/restore sched_clock_state functions and use restore_state to write to KVM_SYSTEM_TIME MSR, forcing an update. Fixes suspend-to-disk with kvmclock. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 15d9915..c91e8b9 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -61,7 +61,7 @@ extern void check_tsc_sync_source(int cpu); extern void check_tsc_sync_target(void); extern int notsc_setup(char *); -extern void save_sched_clock_state(void); -extern void restore_sched_clock_state(void); +extern void tsc_save_sched_clock_state(void); +extern void tsc_restore_sched_clock_state(void); #endif /* _ASM_X86_TSC_H */ diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 5d0afac..baaca8d 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -162,6 +162,8 @@ struct x86_cpuinit_ops { * @is_untracked_pat_range exclude from PAT logic * @nmi_init enable NMI on cpus * @i8042_detect pre-detect if i8042 controller exists + * @save_sched_clock_state: save state for sched_clock() on suspend + * @restore_sched_clock_state: restore state for sched_clock() on resume */ struct x86_platform_ops { unsigned long (*calibrate_tsc)(void); @@ -173,6 +175,8 @@ struct x86_platform_ops { void (*nmi_init)(void); unsigned char (*get_nmi_reason)(void); int (*i8042_detect)(void); + void (*save_sched_clock_state)(void); + void (*restore_sched_clock_state)(void); }; struct pci_dev; diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ca4e735..57e6b78 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -136,6 +136,15 @@ int kvm_register_clock(char *txt) return ret; } +void kvm_save_sched_clock_state(void) static ? +{ +} + +void kvm_restore_sched_clock_state(void) Ditto +{ + kvm_register_clock(primary cpu clock, resume); +} + Otherwise: Reviewed-by: Thomas Gleixner t...@linutronix.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Introduce x86_cpuinit.early_percpu_clock_init hook
On Tue, 7 Feb 2012, Igor Mammedov wrote: When kvm guest uses kvmclock, it may hang on vcpu hot-plug. This is caused by an overflow in pvclock_get_nsec_offset, u64 delta = tsc - shadow-tsc_timestamp; which in turn is caused by an undefined values from percpu hv_clock that hasn't been initialized yet. Uninitialized clock on being booted cpu is accessed from start_secondary - smp_callin - smp_store_cpu_info - identify_secondary_cpu - mtrr_ap_init - mtrr_restore - stop_machine_from_inactive_cpu - queue_stop_cpus_work ... - sched_clock - kvm_clock_read which is well before x86_cpuinit.setup_percpu_clockev call in start_secondary, where percpu clock is initialized. This patch introduces a hook that allows to setup/initialize per_cpu clock early and avoid overflow due to reading - undefined values - old values if cpu was offlined and then onlined again Another possible early user of this clock source is ftrace that accesses it to get timestamps for ring buffer entries. So if mtrr_ap_init is moved from identify_secondary_cpu to past x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace may cause the same overflow/hang on cpu hot-plug anyway. More complete description of the problem: https://lkml.org/lkml/2012/2/2/101 Credits to Marcelo Tosatti mtosa...@redhat.com for hook idea. Signed-off-by: Igor Mammedov imamm...@redhat.com --- arch/x86/include/asm/x86_init.h |2 ++ arch/x86/kernel/kvmclock.c |4 +--- arch/x86/kernel/smpboot.c |1 + arch/x86/kernel/x86_init.c |1 + 4 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 517d476..5d0afac 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -145,9 +145,11 @@ struct x86_init_ops { /** * struct x86_cpuinit_ops - platform specific cpu hotplug setups * @setup_percpu_clockev:set up the per cpu clock event device + * @early_percpu_clock_init: early init of the per cpu clock event device You initialize the per cpu clock, not the per cpu clock event device. The latter is still initialized via setup_percpu_clockev(). Otherwise Acked-by: Thomas Gleixner t...@linutronix.de -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Fri, 17 Dec 2010, Jan Kiszka wrote: Am 16.12.2010 21:26, Jan Kiszka wrote: Am 16.12.2010 14:13, Thomas Gleixner wrote: On Mon, 13 Dec 2010, Jan Kiszka wrote: + if (old_action (old_action-flags IRQF_ADAPTIVE) + !(desc-irq_data.drv_status IRQS_SHARED)) { + /* + * Signal the old handler that is has to switch to shareable + * handling mode. Disable the line to avoid any conflict with + * a real IRQ. + */ + disable_irq(irq); This is weird, really. I thought you wanted to avoid waiting for the threaded handler to finish if it's on the fly. So this should be disable_irq_nosync() or did you change your mind ? No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there might be another IRQ in flight. The handler that is called due to a real IRQ might misinterpret MAKE_SHAREABLE as there is no real event and perform the wrong steps (at least the current implementation for KVM would). Actually, the requirement we have to fulfill here is to avoid that the hardirq handler sees !SHARED while the threaded one reads SHARED. To achieve this without disabling the line, I'm still searching for a way to couple the sharing state of associated hard and threaded handler runs - but I think there is no reliable association, is there? Unfortunately not. So the only way to solve that is disabling the interrupt which makes sure that all handlers have completed. OTOH, if we have to disable anyway, then we could simply keep it disabled across the installation of a new handler. That would make the notification business go away, wouldn't it ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Fri, 17 Dec 2010, Jan Kiszka wrote: Am 17.12.2010 11:23, Thomas Gleixner wrote: OTOH, if we have to disable anyway, then we could simply keep it disabled across the installation of a new handler. That would make the notification business go away, wouldn't it ? No, the notification is still necessary in case the registered handler keeps the line off after returning from both hard and threaded handler. And how should that happen? If it is in oneshot mode then the line is reenabled when the thread handler returns. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Fri, 17 Dec 2010, Jan Kiszka wrote: Am 17.12.2010 11:41, Thomas Gleixner wrote: On Fri, 17 Dec 2010, Jan Kiszka wrote: Am 17.12.2010 11:23, Thomas Gleixner wrote: OTOH, if we have to disable anyway, then we could simply keep it disabled across the installation of a new handler. That would make the notification business go away, wouldn't it ? No, the notification is still necessary in case the registered handler keeps the line off after returning from both hard and threaded handler. And how should that happen? If it is in oneshot mode then the line is reenabled when the thread handler returns. disable_irq_nosync is called by the handler before returning. And it's the handler's job to revert this, properly synchronizing it internally. disable_irq_nosync() is really the worst thing to do. That's simply not going to work without a lot of fuglyness. What about the following: primary_handler() { if (!shared) return IRQ_WAKE_THREAD; spin_lock(dev-irq_lock); if (from_my_device() || dev-irq_thread_waiting) { mask_dev(); dev-masked = true; ret = IRQ_WAKE_THREAD; } else ret = IRQ_NONE; spin_unlock(dev-irq_lock); return ret; } check_timeout() { if (dev-irq_active wait_longer()) return IRQ_WAKE_THREAD; return IRQ_HANDLED; } unmask_dev_if_necessary() { if (dev-masked dev-irq_active) umask_dev(); } threaded_handler() { if (!dev-irq_thread_waiting) { spin_lock_irq(dev-irq_lock); wake_user = do_magic_stuff_with_the_dev(); dev-irq_thread_waiting = wake_user; spin_unlock(dev-irq_lock); if (wake_user) wake_up(user); } if (!dev-irq_thread_waiting) { spin_lock_irq(dev-irq_lock); unmask_dev_if_necessary(); spin_unlock(dev-irq_lock); return IRQ_HANDLED; } /* * Wait for user space to complete. Timeout is to * avoid starvation of the irq line when * something goes wrong */ wait_for_completion_timeout(dev-compl, SENSIBLE_TIMEOUT); spin_lock_irq(dev-irq_lock); if (timedout) { mask_dev(); dev-masked = true; /* * Leave dev-irq_thread_waiting untouched and let * the core code reschedule us when check_timeout * decides it's worth to wait. In any case we leave * the device masked at the device level, so we don't * cause an interrupt storm. */ ret = check_timeout(); } else { unmask_dev_if_necessary(); dev-irq_thread_waiting = false; ret = IRQ_HANDLED; } spin_unlock(dev-irq_lock); return ret; } userspace_complete() { complete(dev-irq_compl); } Your aproach with disable_irq_nosync() is completely flawed, simply because you try to pretend that your interrupt handler is done, while it is not done at all. The threaded interrupt handler is done when user space completes. Everything else is just hacking around the problem and creates all that nasty transitional problems. The above code does not have them at all. The threaded handler does not care at all about the dev_id shared state encoding and the state transitions. It merily cares about the device internal status dev-masked. Everything else is handled by the genirq code and the litte status check in the primary handler. Neither does the user space completion care about it. It just completes and is completely oblivious of the irq line state/mode. And really, the user space part should not care at all. It can set some status before calling complete(), but that's driver specific stuff and has nothing to do with the irq handling magic. It requires a few not too intrusive modifications to the genirq code: - Rescheduling the thread handler on IRQ_WAKE_THREAD - Changing the oneshot finalizing a bit - Adding the status manipulations for request/free_irq Now you might argue that the timeout is ugly, but I don't think it's ugly at all. You need it anyway in case user space failed completely. And coming back after 100ms to let the genirq code handle a disable_irq() or synchronize_irq() is a reasonable request, it's the error/corner case at all. If there is code which installs/removes an interrupt handler on the same line every 5ms, then this code becomes rightfully blocked out for 100ms or such. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Fri, 17 Dec 2010, Jan Kiszka wrote: Am 17.12.2010 16:25, Thomas Gleixner wrote: Your aproach with disable_irq_nosync() is completely flawed, simply because you try to pretend that your interrupt handler is done, while it is not done at all. The threaded interrupt handler is done when user space completes. Everything else is just hacking around the problem and creates all that nasty transitional problems. disable_irq_nosync is the pattern currently used in KVM, it's nothing new in fact. That does not make it less flawed :) The approach looks interesting but requires separate code for non-PCI-2.3 devices, i.e. when we have no means to mask at device level. Why? You can have the same code, you just can't request COND_ONESHOT handlers for it, so it needs unshared ONESHOT or it won't work at all, no matter what approach you chose. No device level mask, no sharing, it's that simple. Further drawbacks - unless I missed something on first glance: - prevents any future optimizations that would work without IRQ thread ping-pong (ie. once we allow guest IRQ injection from hardirq context for selected but typical setups) - two additional, though light-weight, context switches on each interrupt completion The drawback of these two points is way less than the horror which you need to introduce to do the whole async handler disable, userspace enable dance. Robust and simple solutions really a preferred over complex and fragile horror which has a questionable runtime benefit. - continuous polling if user space decides to leave the interrupt unhandled (e.g. because the virtual IRQ line is masked) That should be a solvable problem. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Mon, 13 Dec 2010, Jan Kiszka wrote: + if (old_action (old_action-flags IRQF_ADAPTIVE) + !(desc-irq_data.drv_status IRQS_SHARED)) { + /* + * Signal the old handler that is has to switch to shareable + * handling mode. Disable the line to avoid any conflict with + * a real IRQ. + */ + disable_irq(irq); This is weird, really. I thought you wanted to avoid waiting for the threaded handler to finish if it's on the fly. So this should be disable_irq_nosync() or did you change your mind ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 15.12.2010 09:05, Thomas Gleixner wrote: On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 14.12.2010 22:46, Thomas Gleixner wrote: On Mon, 13 Dec 2010, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com chip_bus_lock(desc); retval = __setup_irq(irq, desc, action); chip_bus_sync_unlock(desc); -if (retval) +if (retval) { +if (desc-action !desc-action-next) +desc-irq_data.drv_status = ~IRQS_SHARED; This is redundant. IRQS_SHARED gets set in a code path where all checks are done already. Nope, it's also set before entry of __setup_irq in case we call an IRQF_ADAPTIVE handler. We need to set it that early as we may race with IRQ events for the already registered handler happening between the sharing notification and the actual registration of the second handler. Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the notification. For notification, yes. But we need SHARED once we reenable the line after the notification. Darn. Will think more about it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 14.12.2010 21:54, Thomas Gleixner wrote: On Mon, 13 Dec 2010, Jan Kiszka wrote: @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Make sure it's not being used on another CPU: */ synchronize_irq(irq); + if (single_handler) + desc-irq_data.drv_status = ~IRQS_SHARED; + What's the reason to clear this flag outside of the desc-lock held region. We need to synchronize the irq first before clearing the flag. The problematic scenario behind this: An IRQ started in shared mode, this the line was unmasked after the hardirq. Now we clear IRQS_SHARED before calling into the threaded handler. And that handler may now think that the line is still masked as IRQS_SHARED is set. That should read not set I guess. Hmm, needs more thoughts :( I need this status for other purposes as well, where I definitely need serialization. Well, two options: wrap all bit manipulations with desc-lock acquisition/release or turn drv_status into an atomic. I don't know what your plans with drv_status are, so... Some bits for irq migration and other stuff, which allows us to avoid fiddling with irqdesc in the drivers. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 15.12.2010 14:04, Thomas Gleixner wrote: On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 14.12.2010 21:54, Thomas Gleixner wrote: On Mon, 13 Dec 2010, Jan Kiszka wrote: @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Make sure it's not being used on another CPU: */ synchronize_irq(irq); +if (single_handler) +desc-irq_data.drv_status = ~IRQS_SHARED; + What's the reason to clear this flag outside of the desc-lock held region. We need to synchronize the irq first before clearing the flag. The problematic scenario behind this: An IRQ started in shared mode, this the line was unmasked after the hardirq. Now we clear IRQS_SHARED before calling into the threaded handler. And that handler may now think that the line is still masked as IRQS_SHARED is set. That should read not set I guess. Can't remember who wrote this, but that guy might have been too tired for clear sentences: Yes, of course, we could run into troubles, if IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard and threaded handler. Right. As a side note, the current implementation requires that you lookup irq_data.drv_status for every invocation of the handler or have a reference to irq_data.drv_status somewhere locally, which I don't like either. I have an evil and nasy idea how to avoid that, need to look how ugly it gets. Worst case we need to go back to that notification thing which I wanted really avoid in the first place. Though I like the register_mutex idea which came out of this a lot as it allows us to reduce desc-lock held and interrupt disabled regions quite nicely. /me goes back to stare at the code Hmm, needs more thoughts :( Be warned, might be painful. Bah, my brain became pain resistant when I started hacking that code. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 15.12.2010 14:04, Thomas Gleixner wrote: On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 14.12.2010 21:54, Thomas Gleixner wrote: On Mon, 13 Dec 2010, Jan Kiszka wrote: @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Make sure it's not being used on another CPU: */ synchronize_irq(irq); +if (single_handler) +desc-irq_data.drv_status = ~IRQS_SHARED; + What's the reason to clear this flag outside of the desc-lock held region. We need to synchronize the irq first before clearing the flag. The problematic scenario behind this: An IRQ started in shared mode, this the line was unmasked after the hardirq. Now we clear IRQS_SHARED before calling into the threaded handler. And that handler may now think that the line is still masked as IRQS_SHARED is set. That should read not set I guess. Can't remember who wrote this, but that guy might have been too tired for clear sentences: Yes, of course, we could run into troubles, if IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard and threaded handler. Hmm, needs more thoughts :( Be warned, might be painful. Talking about headache. Your solution above does not prevent that scenario. CPU 0 CPU 1 synchronize_irq(); hard irq comes in sees shared and unmasks clear IRQS_SHARED thread handler runs and sees !SHARED Same scenario, just moved by a few lines :) Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Wed, 15 Dec 2010, Jan Kiszka wrote: Am 15.12.2010 16:41, Thomas Gleixner wrote: Talking about headache. Your solution above does not prevent that scenario. CPU 0 CPU 1 synchronize_irq(); hard irq comes in sees shared and unmasks Nope, IRQ_ONESHOT is already cleared at that point. Errm ? It's set. Could you please stop to increase my confusion ? :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word
On Mon, 13 Dec 2010, Jan Kiszka wrote: +/** + * get_irq_status - read interrupt line status word + * @irq: Interrupt line of the status word + * + * This returns the current content of the status word associated with + * the given interrupt line. See IRQS_* flags for details. + */ +unsigned long get_irq_status(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + + return desc ? desc-irq_data.drv_status : 0; +} +EXPORT_SYMBOL_GPL(get_irq_status); We should document that this is a snapshot and in no way serialized against modifications of drv_status. I'll fix up the kernel doc. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Mon, 13 Dec 2010, Jan Kiszka wrote: @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Make sure it's not being used on another CPU: */ synchronize_irq(irq); + if (single_handler) + desc-irq_data.drv_status = ~IRQS_SHARED; + What's the reason to clear this flag outside of the desc-lock held region. I need this status for other purposes as well, where I definitely need serialization. + mutex_lock(register_lock); + + old_action = desc-action; + if (old_action (old_action-flags IRQF_ADAPTIVE) + !(desc-irq_data.drv_status IRQS_SHARED)) { + /* + * Signal the old handler that is has to switch to shareable + * handling mode. Disable the line to avoid any conflict with + * a real IRQ. + */ + disable_irq(irq); + local_irq_disable(); + + desc-irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE; Unserialized access as well. Will think about it. + old_action-handler(irq, old_action-dev_id); + desc-irq_data.drv_status = ~IRQS_MAKE_SHAREABLE; Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Mon, 13 Dec 2010, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com chip_bus_lock(desc); retval = __setup_irq(irq, desc, action); chip_bus_sync_unlock(desc); - if (retval) + if (retval) { + if (desc-action !desc-action-next) + desc-irq_data.drv_status = ~IRQS_SHARED; This is redundant. IRQS_SHARED gets set in a code path where all checks are done already. To make that more obvious we can set it right before raw_spin_unlock_irqrestore(desc-lock, flags); conditionally on (shared). That way we can also move the kfree out of the mutex locked section. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices
On Mon, 13 Dec 2010, Jan Kiszka wrote: This addresses the review comments of the previous round: - renamed irq_data::status to drv_status - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED - fixed signature of get_irq_status (irq is now unsigned int) - converted register_lock into a global one - fixed critical white space breakage (that I just left in to check if anyone is actually reading the code, of course...) Just for the record, you either missed or introduced some new white space noise :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word
On Sun, 12 Dec 2010, Jan Kiszka wrote: Am 12.12.2010 18:29, Thomas Gleixner wrote: Also we should name it different than status, drv_status perhaps, to avoid confusion with the irq_desc status. OK, will address both in a succeeding round (just waiting for potential further comments). No further comments from my side ATM. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/4] genirq: Inform handler about line sharing state
On Sun, 12 Dec 2010, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This enabled interrupt handlers to retrieve the current line sharing state via the new interrupt status word so that they can adapt to it. The switch from shared to exclusive is generally uncritical and can thus be performed on demand. However, preparing a line for shared mode may require preparational steps of the currently registered handler. It can therefore request an ahead-of-time notification via IRQF_ADAPTIVE. The notification consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in the status word. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/interrupt.h | 10 ++ include/linux/irqdesc.h |2 ++ kernel/irq/irqdesc.c |2 ++ kernel/irq/manage.c | 44 +--- 4 files changed, 55 insertions(+), 3 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 16cdbbf..c6323a2 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -55,6 +55,7 @@ *Used by threaded interrupts which need to keep the *irq line disabled until the threaded handler has been run. * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend + * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing * */ #define IRQF_DISABLED0x0020 @@ -67,6 +68,7 @@ #define IRQF_IRQPOLL 0x1000 #define IRQF_ONESHOT 0x2000 #define IRQF_NO_SUSPEND 0x4000 +#define IRQF_ADAPTIVE0x8000 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND) @@ -126,6 +128,14 @@ struct irqaction { extern irqreturn_t no_action(int cpl, void *dev_id); +/* + * Driver-readable IRQ line status flags: + * IRQS_SHARED - line is shared between multiple handlers + * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable + */ +#define IRQS_SHARED 0x0001 +#define IRQS_MAKE_SHAREABLE 0x0002 + extern unsigned long get_irq_status(unsigned long irq); #ifdef CONFIG_GENERIC_HARDIRQS diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index 979c68c..c490e83 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -29,6 +29,7 @@ struct timer_rand_state; * @wait_for_threads:wait queue for sync_irq to wait for threaded handlers * @dir: /proc/irq/ procfs entry * @name:flow handler name for /proc/interrupts output + * @register_lock: protects registration release, for unshared-shared I think we can make that a global mutex. request/free_irq are not hotpath operations which require a mutex per irq descriptor. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word
On Sun, 12 Dec 2010, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This associates a status word with every IRQ descriptor. Drivers can obtain its content via get_irq_status(irq). First use case will be propagating the interrupt sharing state. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/interrupt.h |2 ++ include/linux/irq.h |2 ++ kernel/irq/manage.c | 15 +++ 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 79d0c4f..16cdbbf 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -126,6 +126,8 @@ struct irqaction { extern irqreturn_t no_action(int cpl, void *dev_id); +extern unsigned long get_irq_status(unsigned long irq); + #ifdef CONFIG_GENERIC_HARDIRQS extern int __must_check request_threaded_irq(unsigned int irq, irq_handler_t handler, diff --git a/include/linux/irq.h b/include/linux/irq.h index abde252..5554203 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -96,6 +96,7 @@ struct msi_desc; * methods, to allow shared chip implementations * @msi_desc:MSI descriptor * @affinity:IRQ affinity on SMP + * @status: driver-readable status flags (IRQS_*) * * The fields here need to overlay the ones in irq_desc until we * cleaned up the direct references and switched everything over to @@ -108,6 +109,7 @@ struct irq_data { void*handler_data; void*chip_data; struct msi_desc *msi_desc; + unsigned long status; That breaks the current code which has irq_data embedded and shadowed in irq_desc for migratory reasons until all users are fixed up and the GENERIC_HARDIRQS_NO_DEPRECATED sections are cleaned up. I know it's ugly, but that was the only way not to rewrite the world in one go. :) Just move it below affinity. Also we should name it different than status, drv_status perhaps, to avoid confusion with the irq_desc status. #ifdef CONFIG_SMP cpumask_var_t affinity; #endif Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices
On Sat, 4 Dec 2010, Jan Kiszka wrote: Besides 3 cleanup patches, this series consists of two major changes. The first introduces an interrupt sharing notifier to the genirq subsystem. It fires when an interrupt line is about to be use by more than one driver or the last but one user called free_irq. The second major change makes use of this interface in KVM's PCI pass- through subsystem. KVM has to keep the interrupt source disabled while calling into the guest to handle the event. This can be done at device or line level. The former is required to share the interrupt line, the latter is an order of magnitude faster (see patch 3 for details). Beside pass-through support of KVM, further users of the IRQ notifier could become VFIO (not yet mainline) and uio_pci_generic which have to resolve the same conflict. Hmm. You basically want to have the following functionality: If interrupt is shared, then you want to keep the current behaviour: disable at line level (IRQF_ONESHOT) run handler thread (PCI level masking) reenable at line level in irq_finalize_oneshot() reenable at PCI level when guest is done If interrupt is not shared: disable at line level (IRQF_ONESHOT) run handler thread (no PCI level masking) no reenable at line level reenable at line level when guest is done I think the whole notifier approach including the extra irq handlers plus the requirement to call disable_irq_nosync() from the non shared handler is overkill. We can be more clever. The genirq code knows whether you have one or more handler registered. So we can add IRQF_ONESHOT_UNMASK_SHARED and add a status field to irq_data (which I was going to do anyway for other reasons). In that status field you get a bit which says IRQ_MASK_DEVICE. So with IRQF_ONESHOT_UNMASK_SHARED == 0 we keep the current behaviour. If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 1 we keep the current behaviour. If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 0 then then irq_finalize_oneshot() simply marks the interrupt disabled (equivalent to disable_irq_nosync()) and returns. Now in your primary irq handler you simply check the IRQ_MASK_DEVICE status flag and decide whether you need to mask at PCI level or not. Your threaded handler gets the same information via IRQ_MASK_DEVICE so it can issue the appropriate user space notification depending on that flag. This works fully transparent across adding and removing handlers. On request_irq/free_irq we update the IRQ_MASK_DEVICE flag with the following logic: nr_actions IRQF_ONESHOT_UNMASK_SHARED IRQ_MASK_DEVICE 1 0 1 1 1 0 1 don't care 1 If interrupts are in flight accross request/free then this change takes effect when the next interrupt comes in. No notifiers, no disable_irq_nosync() magic, less and simpler code. Thoughts ? tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices
Jan, On Sat, 4 Dec 2010, Jan Kiszka wrote: Am 04.12.2010 11:37, Thomas Gleixner wrote: On Sat, 4 Dec 2010, Jan Kiszka wrote: If interrupt is shared, then you want to keep the current behaviour: disable at line level (IRQF_ONESHOT) run handler thread (PCI level masking) reenable at line level in irq_finalize_oneshot() reenable at PCI level when guest is done If the interrupt is shared, we must mask at PCI level inside the primary handler as we also have to support non-threaded users of the same line. So we always have a transition line-level - device-level masking in a primary handler. Sorry that left out the hard irq part. Of course it needs to do the PCI level masking in the primary one. reduce the latency. So both threaded and non-threaded cases should be addressable by any approach. The oneshot magic should work on non threaded cases as well. Needs some modifications, but nothing complex. If interrupts are in flight accross request/free then this change takes effect when the next interrupt comes in. For registration, that might be too late. We need to synchronize on in-flight interrupts for that line and then ensure that it gets enabled independently of the registered user. That user may have applied outdated information, thus would block the line for too long if user space decides to do something else. No, that's really just a corner case when going from one to two handlers and I don't think it matters much. If you setup a new driver it's not really important whether that first thing comes in a few ms later. Also there is a pretty simple solution for this: The core code knows, that there is an ONESHOT interrupt in flight, so it simply can call the primary handler of that device with the appropriate flag set (maybe an additional one to indicate the transition) and let that deal with it. Needs some thought vs. locking and races, but that shouldn't be hard. Pulling the effect of disable_irq_nosync into genirq also for threaded handling would remove the need for re-registering handlers. That's good. But we need to think about the hand-over scenarios again. Maybe solvable, but non-trivial. See above. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices
On Sat, 4 Dec 2010, Jan Kiszka wrote: Am 04.12.2010 15:41, Thomas Gleixner wrote: Also there is a pretty simple solution for this: The core code knows, that there is an ONESHOT interrupt in flight, so it simply can call It doesn't synchronize the tail part against the masking in the handler(s), that's driver business. Right, but the core knows from the irq state, that the line is masked and due to the ONESHOT or whatever feature it knows that it needs to call the handler. The other way round shared - exclusive does not matter at all. the primary handler of that device with the appropriate flag set (maybe an additional one to indicate the transition) and let that deal with it. Needs some thought vs. locking and races, but that shouldn't be hard. Yes, I thought about this kind of transition (re-invoking the existing handler) already. We do need notification of the switch (at least for exclusive-shared) as only the driver can migrate the masking for in-flight interrupts. Right. It needs to set the device level mask in that case. As the interrupt handler already has the code to do that it's the most obvious function to call. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] x86: kvm: Convert i8254/i8259 locks to raw_spinlocks
On Tue, 23 Feb 2010, Jan Kiszka wrote: Thomas Gleixner wrote: The i8254/i8259 locks need to be real spinlocks on preempt-rt. Convert them to raw_spinlock. No change for !RT kernels. Doesn't fly for -rt anymore: pic_irq_update runs under this raw lock and calls kvm_vcpu_kick which tries to wake_up some thread - scheduling while atomic. Hmm, a wakeup itself is fine. Is that code waking a wake queue ? This used to work up to 956f97cf. -rt for 2.6.31 is not yet affected, but 2.6.33 should be broken (haven't checked, using kvm-kmod over 2.6.31 ATM). I can provide a patch that restores the deferred kicking if it's acceptable for upstream. Well, at least is would be nice to have one for -rt. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] x86: kvm: Convert i8254/i8259 locks to raw_spinlocks
The i8254/i8259 locks need to be real spinlocks on preempt-rt. Convert them to raw_spinlock. No change for !RT kernels. Signed-off-by: Thomas Gleixner t...@linutronix.de --- arch/x86/kvm/i8254.c | 10 +- arch/x86/kvm/i8254.h |2 +- arch/x86/kvm/i8259.c | 31 --- arch/x86/kvm/irq.h |2 +- arch/x86/kvm/x86.c |8 5 files changed, 27 insertions(+), 26 deletions(-) Index: linux-2.6-tip/arch/x86/kvm/i8254.c === --- linux-2.6-tip.orig/arch/x86/kvm/i8254.c +++ linux-2.6-tip/arch/x86/kvm/i8254.c @@ -242,11 +242,11 @@ static void kvm_pit_ack_irq(struct kvm_i { struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state, irq_ack_notifier); - spin_lock(ps-inject_lock); + raw_spin_lock(ps-inject_lock); if (atomic_dec_return(ps-pit_timer.pending) 0) atomic_inc(ps-pit_timer.pending); ps-irq_ack = 1; - spin_unlock(ps-inject_lock); + raw_spin_unlock(ps-inject_lock); } void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu) @@ -624,7 +624,7 @@ struct kvm_pit *kvm_create_pit(struct kv mutex_init(pit-pit_state.lock); mutex_lock(pit-pit_state.lock); - spin_lock_init(pit-pit_state.inject_lock); + raw_spin_lock_init(pit-pit_state.inject_lock); kvm-arch.vpit = pit; pit-kvm = kvm; @@ -723,12 +723,12 @@ void kvm_inject_pit_timer_irqs(struct kv /* Try to inject pending interrupts when * last one has been acked. */ - spin_lock(ps-inject_lock); + raw_spin_lock(ps-inject_lock); if (atomic_read(ps-pit_timer.pending) ps-irq_ack) { ps-irq_ack = 0; inject = 1; } - spin_unlock(ps-inject_lock); + raw_spin_unlock(ps-inject_lock); if (inject) __inject_pit_timer_intr(kvm); } Index: linux-2.6-tip/arch/x86/kvm/i8254.h === --- linux-2.6-tip.orig/arch/x86/kvm/i8254.h +++ linux-2.6-tip/arch/x86/kvm/i8254.h @@ -27,7 +27,7 @@ struct kvm_kpit_state { u32speaker_data_on; struct mutex lock; struct kvm_pit *pit; - spinlock_t inject_lock; + raw_spinlock_t inject_lock; unsigned long irq_ack; struct kvm_irq_ack_notifier irq_ack_notifier; }; Index: linux-2.6-tip/arch/x86/kvm/i8259.c === --- linux-2.6-tip.orig/arch/x86/kvm/i8259.c +++ linux-2.6-tip/arch/x86/kvm/i8259.c @@ -44,18 +44,19 @@ static void pic_clear_isr(struct kvm_kpi * Other interrupt may be delivered to PIC while lock is dropped but * it should be safe since PIC state is already updated at this stage. */ - spin_unlock(s-pics_state-lock); + raw_spin_unlock(s-pics_state-lock); kvm_notify_acked_irq(s-pics_state-kvm, SELECT_PIC(irq), irq); - spin_lock(s-pics_state-lock); + raw_spin_lock(s-pics_state-lock); } void kvm_pic_clear_isr_ack(struct kvm *kvm) { struct kvm_pic *s = pic_irqchip(kvm); - spin_lock(s-lock); + + raw_spin_lock(s-lock); s-pics[0].isr_ack = 0xff; s-pics[1].isr_ack = 0xff; - spin_unlock(s-lock); + raw_spin_unlock(s-lock); } /* @@ -156,9 +157,9 @@ static void pic_update_irq(struct kvm_pi void kvm_pic_update_irq(struct kvm_pic *s) { - spin_lock(s-lock); + raw_spin_lock(s-lock); pic_update_irq(s); - spin_unlock(s-lock); + raw_spin_unlock(s-lock); } int kvm_pic_set_irq(void *opaque, int irq, int level) @@ -166,14 +167,14 @@ int kvm_pic_set_irq(void *opaque, int ir struct kvm_pic *s = opaque; int ret = -1; - spin_lock(s-lock); + raw_spin_lock(s-lock); if (irq = 0 irq PIC_NUM_PINS) { ret = pic_set_irq1(s-pics[irq 3], irq 7, level); pic_update_irq(s); trace_kvm_pic_set_irq(irq 3, irq 7, s-pics[irq 3].elcr, s-pics[irq 3].imr, ret == 0); } - spin_unlock(s-lock); + raw_spin_unlock(s-lock); return ret; } @@ -203,7 +204,7 @@ int kvm_pic_read_irq(struct kvm *kvm) int irq, irq2, intno; struct kvm_pic *s = pic_irqchip(kvm); - spin_lock(s-lock); + raw_spin_lock(s-lock); irq = pic_get_irq(s-pics[0]); if (irq = 0) { pic_intack(s-pics[0], irq); @@ -228,7 +229,7 @@ int kvm_pic_read_irq(struct kvm *kvm) intno = s-pics[0].irq_base + irq; } pic_update_irq(s); - spin_unlock(s-lock); + raw_spin_unlock(s-lock); return intno; } @@ -442,7 +443,7 @@ static int picdev_write
Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]
On Mon, 30 Nov 2009, Tejun Heo wrote: Hello, On 11/28/2009 09:12 PM, Avi Kivity wrote: Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call into the irqs disabled section recently. sched, kvm: Fix race condition involving sched_in_preempt_notifers In finish_task_switch(), fire_sched_in_preempt_notifiers() is called after finish_lock_switch(). However, depending on architecture, preemption can be enabled after finish_lock_switch() which breaks the semantics of preempt notifiers. So move it before finish_arch_switch(). This also makes the in- notifiers symmetric to out- notifiers in terms of locking - now both are called under rq lock. It's not a surprise that this breaks the existing code which does the smp function call. Yes, kvm expects preempt notifiers to be run with irqs enabled. Copying patch author. Hmmm... then, it's broken both ways. The previous code may get preempted after scheduling but before the notifier is run (which breaks the semantics of the callback horribly), the current code No, it _CANNOT_ be preempted at that point: schedule() { preempt_disable(); switch_to(); preempt_enable(); } doesn't satisfy kvm's requirement. Another thing is that in the previous implementation the context is different between the 'in' and 'out' callbacks, which is subtle and nasty. Can kvm be converted to not do smp calls directly? For the time being, maybe it's best to back out the fix given that the only architecture which may be affected by the original bug is ia64 which is the only one with both kvm and the unlocked context switch. Do you have a pointer to the original bug report ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]
On Fri, 27 Nov 2009, Peter Zijlstra wrote: On Fri, 2009-11-27 at 16:03 +0100, Jiri Slaby wrote: On 11/25/2009 01:47 AM, a...@linux-foundation.org wrote: The mm-of-the-moment snapshot 2009-11-24-16-47 has been uploaded to Hi, when executing qemu-kvm I often get following warning and a hard lockup. WARNING: at kernel/smp.c:292 smp_call_function_single+0xbd/0x140() Hardware name: To Be Filled By O.E.M. Modules linked in: kvm_intel kvm fuse ath5k ath Pid: 3265, comm: qemu-kvm Not tainted 2.6.32-rc8-mm1_64 #912 Call Trace: [81039678] warn_slowpath_common+0x78/0xb0 [a007fd50] ? __vcpu_clear+0x0/0xd0 [kvm_intel] [810396bf] warn_slowpath_null+0xf/0x20 [8106410d] smp_call_function_single+0xbd/0x140 [a0080af6] vmx_vcpu_load+0x46/0x170 [kvm_intel] [a004dd94] kvm_arch_vcpu_load+0x24/0x60 [kvm] [a0047a8d] kvm_sched_in+0xd/0x10 [kvm] [8102de37] finish_task_switch+0x67/0xc0 [814699f8] schedule+0x2f8/0x9c0 It is a regression against 2009-11-13-19-59. Any ideas? Looks like kvm is trying to send an IPI from the preempt notifiers, which are called with IRQs disabled, not a sane thing to do. If they really want that, they'll have to use a pre-allocated struct call_single_data and use __smp_call_function_single. Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call into the irqs disabled section recently. sched, kvm: Fix race condition involving sched_in_preempt_notifers In finish_task_switch(), fire_sched_in_preempt_notifiers() is called after finish_lock_switch(). However, depending on architecture, preemption can be enabled after finish_lock_switch() which breaks the semantics of preempt notifiers. So move it before finish_arch_switch(). This also makes the in- notifiers symmetric to out- notifiers in terms of locking - now both are called under rq lock. It's not a surprise that this breaks the existing code which does the smp function call. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]
On Fri, 27 Nov 2009, Thomas Gleixner wrote: On Fri, 27 Nov 2009, Peter Zijlstra wrote: On Fri, 2009-11-27 at 16:03 +0100, Jiri Slaby wrote: On 11/25/2009 01:47 AM, a...@linux-foundation.org wrote: The mm-of-the-moment snapshot 2009-11-24-16-47 has been uploaded to Hi, when executing qemu-kvm I often get following warning and a hard lockup. WARNING: at kernel/smp.c:292 smp_call_function_single+0xbd/0x140() Hardware name: To Be Filled By O.E.M. Modules linked in: kvm_intel kvm fuse ath5k ath Pid: 3265, comm: qemu-kvm Not tainted 2.6.32-rc8-mm1_64 #912 Call Trace: [81039678] warn_slowpath_common+0x78/0xb0 [a007fd50] ? __vcpu_clear+0x0/0xd0 [kvm_intel] [810396bf] warn_slowpath_null+0xf/0x20 [8106410d] smp_call_function_single+0xbd/0x140 [a0080af6] vmx_vcpu_load+0x46/0x170 [kvm_intel] [a004dd94] kvm_arch_vcpu_load+0x24/0x60 [kvm] [a0047a8d] kvm_sched_in+0xd/0x10 [kvm] [8102de37] finish_task_switch+0x67/0xc0 [814699f8] schedule+0x2f8/0x9c0 It is a regression against 2009-11-13-19-59. Any ideas? Looks like kvm is trying to send an IPI from the preempt notifiers, which are called with IRQs disabled, not a sane thing to do. If they really want that, they'll have to use a pre-allocated struct call_single_data and use __smp_call_function_single. Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call into the irqs disabled section recently. sched, kvm: Fix race condition involving sched_in_preempt_notifers In finish_task_switch(), fire_sched_in_preempt_notifiers() is called after finish_lock_switch(). However, depending on architecture, preemption can be enabled after finish_lock_switch() which breaks the semantics of preempt notifiers. This is patently wrong btw. schedule() { need_resched: preempt_disable(); task_switch(); preempt_enable_no_resched(); if (need_resched()) goto need_resched; } So move it before finish_arch_switch(). This also makes the in- notifiers symmetric to out- notifiers in terms of locking - now both are called under rq lock. It's not a surprise that this breaks the existing code which does the smp function call. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Wed, 7 Oct 2009, Marcelo Tosatti wrote: On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote: What about getting rid of the retry loop, instead? So something like: - run hrtimer callbacks (once) - while (tick_program_event(expires)) expires = ktime_add_ns(expires, dev-min_delta_ns) This way there's no static tuning involved. And what does that buy us ? We get an timer interrupt right away, so it's not that much different from the retry loop. See below. Its not clear to me why the loop is there in the first place. We get a timer interrupt and handle the expired timers and find out the timer which is going to expire next to reprogram the hardware. Now when we program that expiry time we find out that the timer is already expired. So instead of programming the hardware to fire an interrupt in the very near future which you would do with your loop above we stay in the interrupt handler and expire the timer and any other by now expired timers right away. The hang check is just there to avoid starving (slow) machines. We do this by spreading the timer interrupts out so that the system can do something else than expiring timers. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Thu, 8 Oct 2009, Michael Tokarev wrote: Yesterday I was lucky enough to actually watch what's going on when the delay actually happens. I run desktop environment on a kvm virtual machine here. The server is on diskless terminal, and the rest, incl. the window manager etc, is started from a VM. And yesterday, during normal system load (nothing extra, and not idle either, and all the other guests were running under normal load too), I had a stall of everyhing on this X session for about 2..3, maybe 5 secounds. It felt like completely stuck machine. Nothing were moving on the screen, no reaction to the keyboard etc. And after several seconds it returned to normal. With the familiar message in dmesg -- increasing hrtimer etc, to the next 50%. (Without a patch from Marcelo at this time it shuold increase min_delta to a large number). To summarize: there's something, well, more interesting going on here. In addition to the scheduling issues that causes timers to be calculated on the wrong CPU etc as Care to elaborate ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Thu, 8 Oct 2009, Michael Tokarev wrote: Thomas Gleixner wrote: On Thu, 8 Oct 2009, Michael Tokarev wrote: Yesterday I was lucky enough to actually watch what's going on when the delay actually happens. I run desktop environment on a kvm virtual machine here. The server is on diskless terminal, and the rest, incl. the window manager etc, is started from a VM. And yesterday, during normal system load (nothing extra, and not idle either, and all the other guests were running under normal load too), I had a stall of everyhing on this X session for about 2..3, maybe 5 secounds. It felt like completely stuck machine. Nothing were moving on the screen, no reaction to the keyboard etc. And after several seconds it returned to normal. With the familiar message in dmesg -- increasing hrtimer etc, to the next 50%. (Without a patch from Marcelo at this time it shuold increase min_delta to a large number). To summarize: there's something, well, more interesting going on here. In addition to the scheduling issues that causes timers to be calculated on the wrong CPU etc as Care to elaborate ? Such huge delays (in terms of seconds, not ms or ns) - I don't understand how such delays can be explained by sheduling to the different cpu etc. That's what I mean. I know very little about all this low-level stuff so I may be completely out of context, but such explanation does not look right to me, simple as that. By scheduling mistakes we can get mistakes in range of millisecs, but not secs. I'm really missing the big picture here. What means causes timers to be calculated on the wrong CPU etc ? And what do you consider a scheduling mistake ? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Thu, 8 Oct 2009, Michael Tokarev wrote: Thomas Gleixner wrote: I'm really missing the big picture here. What means causes timers to be calculated on the wrong CPU etc ? And what do you consider a scheduling mistake ? From the initial diagnostics by Marcelo: It seems the way hrtimer_interrupt_hanging calculates min_delta is wrong (especially to virtual machines). The guest vcpu can be scheduled out during the execution of the hrtimer callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor). So high min_delta values can be calculated if, for example, a single hrtimer_interrupt run takes two host time slices to execute, while some other higher priority task runs for N slices in between. From this I conclude that the huge min_delta is due to some other task(s) on the host being run while this guest is in hrtimer callback. But I fail to see why that process on the host takes SO MUCH time, to warrant resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in guest. It's ok to have delays in range of several extra milliseconds, but for *seconds* is too much. Note again that neither host nor guest are not under high load when this jump happens. Also note that there's no high-priority processes running on the host, all are of the same priority level, including all the guests. Note also that so far I only see it on SMP guests, never on UP guests. And only on guests with kvm_clock, not with acpi_pm clocksource. What I'm trying to say is that it looks like there's something else wrong here in the guest code. Huge stalls, huge delays while in hrtimer callback (i think it jappens always when such delay is happening, it's just noticed by hrtimer code) -- that's the root cause of all this, (probably) wrong logic in hrtimer calibration just shows the results of something that's wrong elsewhere. Ah, ok. That makes sense. The hrtimer interrupt hang check detects that the CPU was stolen for whatever reasons. I'm wondering why this happens several times in a row - it takes at least 4 iterations until it decides to make the interval larger. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm guest: hrtimer: interrupt too slow
On Thu, 8 Oct 2009, Marcelo Tosatti wrote: On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote: On Wed, 7 Oct 2009, Marcelo Tosatti wrote: On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote: What about getting rid of the retry loop, instead? So something like: - run hrtimer callbacks (once) - while (tick_program_event(expires)) expires = ktime_add_ns(expires, dev-min_delta_ns) This way there's no static tuning involved. And what does that buy us ? We get an timer interrupt right away, so it's not that much different from the retry loop. See below. Its not clear to me why the loop is there in the first place. We get a timer interrupt and handle the expired timers and find out the timer which is going to expire next to reprogram the hardware. Now when we program that expiry time we find out that the timer is already expired. So instead of programming the hardware to fire an interrupt in the very near future which you would do with your loop above we stay in the interrupt handler and expire the timer and any other by now expired timers right away. The hang check is just there to avoid starving (slow) machines. We do this by spreading the timer interrupts out so that the system can do something else than expiring timers. OK, makes sense. So why not program only the next tick using the heuristic, without touching min_delta_ns? That makes a certain amount of sense albeit I really hate that heuristics crap especially when we know that we run as a guest. We better add a function pointer to the clock event device struct which defaults to force it to be slow for real hardware and can be overridden by paravirt guests. Also it's not clear to me why the problem does only happen with kvm_clock and not with acpi_pm timer emulation (according to the reporter) and is restricted to SMP guests. retry: /* 5 retries is enough to notice a hang */ - if (!(++nr_retries % 5)) - hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now)); + if (!(++nr_retries % 5)) { + ktime_t try_time = ktime_sub(ktime_get(), now); + + do { + for (i = 0; i 3; i++) + expires_next = ktime_add(expires_next,try_time); + } while (tick_program_event(expires_next, 0)); This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc, ...) where we can find out how often this happens. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] Patches for KVM RT
On Thu, 9 Apr 2009, Carsten Emde wrote: Jan, Here are some patches that are necessary to get KVM running with the -rt4 patchset. Thanks a lot. Unfortunaately, there is still a last one at kernel/smp.c:288 /* Can deadlock when called with interrupts disabled */ WARN_ON_ONCE(irqs_disabled() !oops_in_progress); Do we get another fix? I think I have seen that before. Just remembered that I fixed that with Avi last year. Patch got dropped in the 26-29 move. Thanks, tglx From: Thomas Gleixner t...@linutronix.de Date: Mon, 14 Jan 2008 14:02:44 +0200 Subject: CFS: enable irqs in fire_sched_in_preempt_notifier KVM expects the notifier call with irqs enabled. It's necessary due to a possible IPI call. Make the preempt-rt version behave the same way as mainline. Signed-off-by: Thomas Gleixner t...@linutronix.de --- kernel/sched.c |9 + 1 file changed, 9 insertions(+) Index: linux-2.6.24.7-rt27/kernel/sched.c === --- linux-2.6.24.7-rt27.orig/kernel/sched.c 2009-02-08 00:01:16.0 -0500 +++ linux-2.6.24.7-rt27/kernel/sched.c 2009-02-08 00:01:22.0 -0500 @@ -1821,8 +1821,17 @@ static void fire_sched_in_preempt_notifi struct preempt_notifier *notifier; struct hlist_node *node; + if (hlist_empty(curr-preempt_notifiers)) + return; + + /* +* The KVM sched in notifier expects to be called with +* interrupts enabled. +*/ + local_irq_enable(); hlist_for_each_entry(notifier, node, curr-preempt_notifiers, link) notifier-ops-sched_in(notifier, raw_smp_processor_id()); + local_irq_disable(); } static void -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 01/12] expose ACPI pmtimer to userspace (/dev/pmtimer)
On Wed, 4 Jun 2008, Avi Kivity wrote: Anthony Liguori wrote: Thomas Gleixner wrote: Can we please keep that code inside of drivers/clocksource/acpi_pm.c without creating a new disconnected file in drivers/char ? Btw, depending on the use case we might as well have a sysfs entry for that. I think sysfs would actually make a lot of sense for this. It's read many thousands of times per second. You don't want a read()/sprintf()/atoi() sequence every time. Eek, according to Andrea it's only used for migration purpose. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: using smp_processor_id() in preemptible [00000000] code: pm-suspend/17334
On Tue, 3 Jun 2008, Jiri Kosina wrote: Ahh yes - you are right , I've completely forget about that old post - I've thought that my post are usually getting fixed sooner :) So yes - this is actually the same bug which is still not fixed within the latest kernel - the machine is running qemu guest (which seems to me now somehow also slower) OK, so it looks like KVM could be wrongly enabling IRQs/preemption on the resume path. The original bug-report is on http://lkml.org/lkml/2008/4/7/130 sysdev_resume() is supposed to run with interrupts disabled, at least it was that way when the timekeeping_resume code was written. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 01/12] expose ACPI pmtimer to userspace (/dev/pmtimer)
On Sun, 1 Jun 2008, Marcelo Tosatti wrote: On Sun, Jun 01, 2008 at 06:34:27PM +0200, Thomas Gleixner wrote: A sysfs entry sounds fine and much simpler. Should probably be a generic clocksource interface (so userspace can read any available clocksource) rather than acpi_pm specific. Agreed. return clocksource_acpi_pm.read == acpi_pm_read; So we don't need reliable_pmtimer at all. For KVM's use case, we'd rather not allow direct pmtimer access if the host has an unreliable (buggy) chipset. well, return clocksource_acpi_pm.read == acpi_pm_read; is supposed to do that just without an additional variable reliable_pmtimer :) But then, I doubt any of those older affected chipsets have HW virtualization support, so it shouldnt be an issue. It's exactly one old crappy chipset, which definitely has no HW virt support and therefor we just can use read_pmtmr() w/o checking for reliable or not. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html