from:"Thomas Gleixner"

Re: [PATCH] irq_remapping: move structs outside #ifdef

2015-09-18 Thread Thomas Gleixner



On Fri, 18 Sep 2015, Paolo Bonzini wrote:

> This is friendlier to clients of the code, who are going to prepare
> vcpu_data structs unconditionally, even if CONFIG_IRQ_REMAP is not
> defined.
> 
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>


Reviewed-by: Thomas Gleixner <t...@linutronix.de>

> ---
>   Please ack, I'd like to include it in the 4.4 pull request.
> 
>  arch/x86/include/asm/irq_remapping.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h 
> b/arch/x86/include/asm/irq_remapping.h
> index 046c7fb1ca43..a210eba2727c 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -33,6 +33,11 @@ enum irq_remap_cap {
>   IRQ_POSTING_CAP = 0,
>  };
>  
> +struct vcpu_data {
> + u64 pi_desc_addr;   /* Physical address of PI Descriptor */
> + u32 vector; /* Guest vector of the interrupt */
> +};
> +
>  #ifdef CONFIG_IRQ_REMAP
>  
>  extern bool irq_remapping_cap(enum irq_remap_cap cap);
> @@ -58,11 +63,6 @@ static inline struct irq_domain 
> *arch_get_ir_parent_domain(void)
>   return x86_vector_domain;
>  }
>  
> -struct vcpu_data {
> - u64 pi_desc_addr;   /* Physical address of PI Descriptor */
> - u32 vector; /* Guest vector of the interrupt */
> -};
> -
>  #else  /* CONFIG_IRQ_REMAP */
>  
>  static inline bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
> -- 
> 1.8.3.1
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 3/4] irqchip: GIC: Convert to EOImode == 1

2015-08-25 Thread Thomas Gleixner

On Tue, 25 Aug 2015, Marc Zyngier wrote:
 +static struct static_key supports_deactivate = STATIC_KEY_INIT_TRUE;
 +
  #ifndef MAX_GIC_NR
  #define MAX_GIC_NR   1
  #endif
 @@ -137,6 +140,14 @@ static inline unsigned int gic_irq(struct irq_data *d)
   return d-hwirq;
  }
  
 +static inline bool primary_gic_irq(struct irq_data *d)
 +{
 + if (MAX_GIC_NR  1)
 + return irq_data_get_irq_chip_data(d) == gic_data[0];
 +
 + return true;
 +}
 +
  /*
   * Routines to acknowledge, disable and enable interrupts
   */
 @@ -164,7 +175,14 @@ static void gic_unmask_irq(struct irq_data *d)
  
  static void gic_eoi_irq(struct irq_data *d)
  {
 - writel_relaxed(gic_irq(d), gic_cpu_base(d) + GIC_CPU_EOI);
 + u32 deact_offset = GIC_CPU_EOI;
 +
 + if (static_key_true(supports_deactivate)) {
 + if (primary_gic_irq(d))
 + deact_offset = GIC_CPU_DEACTIVATE;

I really wonder for the whole series whether you really want all that
static key dance and extra conditionals in the callbacks instead of
just using seperate irq chips for the different interrupts.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-17 Thread Thomas Gleixner

On Fri, 7 Aug 2015, Peter Zijlstra wrote:
 On Fri, Aug 07, 2015 at 12:57:38PM +0200, Peter Zijlstra wrote:
 
   +void __finish_swait(struct swait_queue_head *q, struct swait_queue 
   *wait)
 
   this one has no users the __ suggests that it is locked edition. Maybe
   it is for the completions…
  
  Yeah, who knows, I certainly do not anymore ;-)
 
 On that, we cannot convert completions to swait. Because swait wake_all
 must not happen from IRQ context, and complete_all() typically is used
 from just that.

Well, completions are not the ones which have a gazillion of waiters
and we got quite some performance back from using swait in completions
on RT.

Thanks,

tglx

Re: [PATCH 4/6] irqchip: GIC: Use chip_data instead of handler_data for cascaded interrupts

2015-07-09 Thread Thomas Gleixner

On Thu, 9 Jul 2015, Marc Zyngier wrote:

 When used as a primary interrupt controller, the GIC driver uses
 irq_data-chip_data to extract controller-specific information.
 
 When used as a secondary interrupt controller, it uses handler_data
 instead. As this difference is relatively pointless and only creates
 confusion, change the secondary path to match what the rest of the
 driver does.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  drivers/irqchip/irq-gic.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
 index e264675..3c7f3a4 100644
 --- a/drivers/irqchip/irq-gic.c
 +++ b/drivers/irqchip/irq-gic.c
 @@ -298,7 +298,7 @@ static void __exception_irq_entry gic_handle_irq(struct 
 pt_regs *regs)
  
  static void gic_handle_cascade_irq(unsigned int irq, struct irq_desc *desc)
  {
 - struct gic_chip_data *chip_data = irq_get_handler_data(irq);
 + struct gic_chip_data *chip_data = irq_get_chip_data(irq);
   struct irq_chip *chip = irq_get_chip(irq);

You should make that

chip_data = irq_desc_get_chip_data(desc);
chip = irq_desc_get_chip(desc);

That avoids two pointless lookups of irqdesc 

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()

2015-06-20 Thread Thomas Gleixner

On Sat, 20 Jun 2015, Andy Lutomirski wrote:
 On Sat, Jun 20, 2015 at 7:14 AM, Thomas Gleixner t...@linutronix.de wrote:
  On Sat, 20 Jun 2015, walter harms wrote:
 
  Acked-by: walter harms wha...@bfs.de
 
  Am 17.06.2015 02:35, schrieb Andy Lutomirski:
   This is only used if BAYCOM_DEBUG is defined.
 
  So why don't we just replace that by ktime_get() and get rid of the
  x86'ism in that driver.
 
 
 I don't have the hardware, and I don't see any good reason to make an
 rdtsc cleanup depend on a more complicated driver change.  On the
 other hand, if the maintainers want to clean it up, I think it would
 be a great idea.
 
 This really seems to be debugging code, though.  A normal kernel won't
 even compile it.

Right, but there is no reason that we have rdtsc outside of arch/x86
at all.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in

Re: [PATCH v3 08/18] baycom_epp: Replace rdtscl() with native_read_tsc()

2015-06-20 Thread Thomas Gleixner

On Sat, 20 Jun 2015, walter harms wrote:

 Acked-by: walter harms wha...@bfs.de
 
 Am 17.06.2015 02:35, schrieb Andy Lutomirski:
  This is only used if BAYCOM_DEBUG is defined.

So why don't we just replace that by ktime_get() and get rid of the
x86'ism in that driver.
 
Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in

Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead

2015-06-16 Thread Thomas Gleixner

On Tue, 16 Jun 2015, Juergen Gross wrote:

 AFAIK there are no outstanding questions for more than one month now.
 I'd appreciate some feedback or accepting these patches.

They are against dead code, which will be gone soon. We switched over
to queued locks.

Thanks,

tglx

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v13 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM

2014-12-02 Thread Thomas Gleixner

On Wed, 29 Oct 2014, Waiman Long wrote:
 AIM7 XFS Disk Test (no overcommit)
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   PV ticketlock 25423737.08   98.95   5.44
   PV qspinlock  25495757.06   98.63   5.40
   unfairlock  26162796.91   97.05   5.42
 
 AIM7 XFS Disk Test (200% overcommit)
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   PV ticketlock 64446827.93  415.22   6.33
   PV qspinlock  64562427.88  419.84   0.39

That number is made up by what? 

   unfairlock  69551825.88  377.40   4.09
 
 AIM7 EXT4 Disk Test (no overcommit)
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   PV ticketlock 19955659.02  103.67   5.76
   PV qspinlock  20111738.95  102.15   5.40
   unfairlock  20665908.71   98.13   5.46
 
 AIM7 EXT4 Disk Test (200% overcommit)
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   PV ticketlock 47834137.63  495.81  30.78
   PV qspinlock  47405837.97  475.74  30.95
   unfairlock  56022432.13  398.43  26.27
 
 For the AIM7 disk workload, both PV ticketlock and qspinlock have
 about the same performance. The unfairlock performs slightly better
 than the PV lock.

Slightly?

Taking the PV locks, which are basically the same for the existing
ticket locks and your new fangled qlocks as a reference then the so
called 'unfair locks' which are just the native locks w/o the PV
nonsense are fundamentally better up to a whopping 18% in the
ext4/200% overcommit case. See below.
 
 EBIZZY-m Test (no overcommit)
   kernelRec/s   Real Time   Sys TimeUsr Time
   - -   -   
   PV ticketlock 3255  10.00   60.65   3.62
   PV qspinlock  3318  10.00   54.27   3.60
   unfairlock  2833  10.00   26.66   3.09
 
 EBIZZY-m Test (200% overcommit)
   kernelRec/s   Real Time   Sys TimeUsr Time
   - -   -   
   PV ticketlock  841  10.00   71.03   2.37
   PV qspinlock   834  10.00   68.27   2.39
   unfairlock   865  10.00   27.08   1.51
 
   futextest (no overcommit)
   kernel   kops/s
   ---
   PV ticketlock11523
   PV qspinlock 12328
   unfairlock  9478
 
   futextest (200% overcommit)
   kernel   kops/s
   ---
   PV ticketlock 7276
   PV qspinlock  7095
   unfairlock  5614
 
 The ebizzy and futextest have much higher spinlock contention than
 the AIM7 disk workload. In this case, the unfairlock performs worse
 than both the PV ticketlock and qspinlock. The performance of the 2
 PV locks are comparable.

While I can see that the PV lock stuff performs 13% better for the
ebizzy no overcommit case, what about the very interresting numbers
for the same test with 200% overcommit?

The regular lock has a slightly better performance, but significantly
less sys/usr time. How do you explain that?

'Lies, damned lies and statistics' comes to my mind.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2014-11-25 Thread Thomas Gleixner

On Tue, 25 Nov 2014, Rik van Riel wrote:
 On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
  The problem:
  
  On -RT, an emulated LAPIC timer instances has the following path:
  
  1) hard interrupt
  2) ksoftirqd is scheduled
  3) ksoftirqd wakes up vcpu thread
  4) vcpu thread is scheduled
  
  This extra context switch introduces unnecessary latency in the 
  LAPIC path for a KVM guest.
  
  The solution:
  
  Allow waking up vcpu thread from hardirq context,
  thus avoiding the need for ksoftirqd to be scheduled.
  
  Normal waitqueues make use of spinlocks, which on -RT 
  are sleepable locks. Therefore, waking up a waitqueue 
  waiter involves locking a sleeping lock, which 
  is not allowed from hard interrupt context.
  
 
 What are the consequences for the realtime kernel of
 making waitqueue traversal non-preemptible?
 
 Is waitqueue traversal something that ever takes so
 long we need to care about it being non-preemptible?

__wake_up
lock_irq_save()
__wake_up_common()
for_each_entry(curr)
curr-func()
unlock_irq_save()

There are two issues here:

1) Long waiter chains. We probably could work around that in some
   creative way

2) The non default callbacks. default is default_wake_function.

   The non default callbacks, which are used by nfsd and other stuff
   are calling the world and some more and take preemptible locks and
   do other fancy stuff which falls flat on its nose when called under
   a raw lock on rt

So I created the swait replacement which has a smaller memory
footprint, less featuritis and a raw lock because its only wakes up,
no custom callbacks.

The still suffer #1, but most use cases on RT are single waiter
scenarios anyway.

Hope that helps.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2014-11-25 Thread Thomas Gleixner

On Tue, 25 Nov 2014, Rik van Riel wrote:
 On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
  Since lapic timer handler only wakes up a simple waitqueue,
  it can be executed from hardirq context.
  
  Reduces average cyclictest latency by 3us.
 
 Can this patch be merged in the KVM tree, and go
 upstream via Paolo?

Not really as it has RT dependencies 

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: export get_xsave_addr

2014-11-24 Thread Thomas Gleixner

On Mon, 24 Nov 2014, Paolo Bonzini wrote:
 get_xsave_addr is the API to access XSAVE states, and KVM would
 like to use it.  Export it.
 
 Cc: x...@kernel.org
 Cc: H. Peter Anvin h...@linux.intel.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   Peter, can you please ACK this for inclusion in the KVM tree?

Are you content with my acked-by as well?

Acked-by: Thomas Gleixner t...@linutronix.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: several messages

2014-11-10 Thread Thomas Gleixner

On Mon, 10 Nov 2014, Feng Wu wrote:

 VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
 With VT-d Posted-Interrupts enabled, external interrupts from
 direct-assigned devices can be delivered to guests without VMM
 intervention when guest is running in non-root mode.

Can you please talk to Jiang and synchronize your work with his
refactoring of the x86 interrupt handling subsystem.

I want this stuff cleaned up first before we add new stuff to it.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only

2014-09-24 Thread Thomas Gleixner

On Mon, 22 Sep 2014, Paolo Bonzini wrote:

 On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
 In that case, KVM will fail to patch VMCALL instructions to VMMCALL
 as required on AMD processors.

 The failure mode is currently a divide-by-zero exception, which obviously
 is a KVM bug that has to be fixed.  However, picking the right instruction
 between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
 the hypervisor.

 -/* This instruction is vmcall.  On non-VT architectures, it will generate a
 - * trap that we will then rewrite to the appropriate instruction.
 +#ifdef CONFIG_DEBUG_RODATA
 +#define KVM_HYPERCALL \
 +ALTERNATIVE(.byte 0x0f,0x01,0xc1, .byte 0x0f,0x01,0xd9, 
 X86_FEATURE_VMMCALL)

If we can do it via a feature bit and alternatives, then why do you
want to patch it manually if CONFIG_DEBUG_RODATA=n?

Just because more #ifdeffery makes the code more readable?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-05 Thread Thomas Gleixner

On Thu, 4 Sep 2014, Paolo Bonzini wrote:

 Il 04/09/2014 22:58, Thomas Gleixner ha scritto:
  This is simply wrong.
 
 It is.
 
  Now I have no idea why you think it needs to add xtime_sec. If the
  result is wrong, then we need to figure out which one of the supplied
  values is wrong and not blindly add xtime_sec just because that makes
  it magically correct.
  
  Can you please provide a proper background why you think that adding
  xtime_sec is a good idea?
 
 It's not a good idea indeed.  I didn't fully digest the 3.16-3.17
 timekeeping changes and messed up this patch.
 
 However, there is a bug in the base_mono + offs_boot formula, given
 that:
 
 - bisection leads to the merge commit of John's timers branch
 
 - bisecting within John's timers branch, with a KVM commit on top to
   make the code much easier to trigger, leads to commit cbcf2dd3b3d4
   (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based,
   2014-07-16).
 
 - I backported your patch to 3.16, using wall_to_monotonic +
   total_sleep_time + xtime_sec (wtm+xtime_sec as in pre-cbcf2dd3b3d4
   code, total_sleep_time from 3.16 monotonic_to_bootbased) and it works
 
 - In v2 of the patch I fixed the bug by changing the formula
   base_mono + offs_boot to offs_boot - offs_real (and then adding
   xtime_sec separately as in the 3.16 backport), but the two formulas
   base_mono + offs_boot and offs_boot - offs_real + xtime_sec ought
   to be identical.

So lets look at the differences here:

3.163.17

inject_sleeptime(delta) inject_sleeptime(delta)

xtime += delta; xtime += delta;

wall_to_mono -= delta;  wall_to_mono -= delta;
offs_real = -wall_to_mono;  offs_real = -wall_to_mono;

sleeptime += delta;
offs_boot = sleeptime;  offs_boot += delta;

getboottime()

tmp = wall_to_mono + sleeptime;
boottime = -tmp;

  So:
boottime = -wall_to_mono - sleeptime;

  Because of the above:
offs_real = -wall_to_mono;
offs_boot = sleeptime;

  The result is:

  boottime = offs_real - offs_boot;   boottime = offs_real - offs_boot;

monotomic_to_bootbased(mono)monotomic_to_bootbased(mono)

   return mono + sleeptime;

  Because of the above:
offs_boot = sleeptime;

  The result is:

return mono + offs_boot;  return mono + offs_boot;
  
Now on KVM side he have

update_pvclock_gtod()   update_pvclock_gtod()

mono_base = xtime + wall_to_mono;  boot_base = mono_base + offs_boot;

and

do_monotonic()  do_monotonic_boot()

mono_now = mono_base + delta_ns;   boot_now = boot_base + delta_ns;

kvm_get_time_and_clockread()

mono_now = do_monotonic()

boot_now = mono_to_boot(mono_now);

So that means on both side the same:

   boot_now = mono_base + offs_boot + delta_ns;

So that means the code is correct. Now where is the bug?

Well hidden and still so obvious that it's even visible through the
brown paperpag I'm wearing ...

Thanks,

tglx

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fb4a9c2cf8d9..ec1791fae965 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -442,11 +442,12 @@ static void timekeeping_update(struct timekeeper *tk, 
unsigned int action)
tk-ntp_error = 0;
ntp_clear();
}
-   update_vsyscall(tk);
-   update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
 
tk_update_ktime_data(tk);
 
+   update_vsyscall(tk);
+   update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
+
if (action  TK_MIRROR)
memcpy(shadow_timekeeper, tk_core.timekeeper,
   sizeof(tk_core.timekeeper));

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-05 Thread Thomas Gleixner

On Fri, 5 Sep 2014, Paolo Bonzini wrote:

 Il 05/09/2014 17:14, Thomas Gleixner ha scritto:
  So that means the code is correct. Now where is the bug?
 
 In kernel/time/timekeeping.c?
 
 We know that we should have
 
   base_mono = wall_to_monotonic + xtime_sec
 
 Instead it is
 
   base_mono = wall_to_monotonic + xtime_sec
   - seconds from boot time
 
 which is... zero.  Given this is the only use of base_mono in a
 notifier, I wonder if it is as simple as this (which I don't have time
 to test right now):
 
 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index fb4a9c2cf8d9..f6807a85b8c9 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -443,9 +443,9 @@ static void timekeeping_update(struct timekeeper
 *tk, unsigned int action)
   ntp_clear();
   }
   update_vsyscall(tk);
 - update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
 
   tk_update_ktime_data(tk);
 + update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);

Why are you moving the update between vsycall and pvclock update as I
did in my patch? We really need to update everything before calling
somewhere.
 
And yes it is that simple. I instrumented the stuff and its correct
now.

Thanks

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-05 Thread Thomas Gleixner

On Fri, 5 Sep 2014, Paolo Bonzini wrote:

 Il 05/09/2014 20:33, Thomas Gleixner ha scritto:
update_vsyscall(tk);
   -update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
   
tk_update_ktime_data(tk);
   +update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
  Why are you moving the update between vsycall and pvclock update as I
  did in my patch? We really need to update everything before calling
  somewhere.
 
 Do you mean the call should be moved not just after tk_update_ktime_data
 (which sets base_mono), but further down after
 
 update_fast_timekeeper(tk);

No, it needs to be above update_vsyscall(). Here is the patch again
which I sent before. [https://lkml.org/lkml/2014/9/5/395]

Thanks,

tglx

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index fb4a9c2cf8d9..ec1791fae965 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -442,11 +442,12 @@ static void timekeeping_update(struct timekeeper *tk, 
unsigned int action)
tk-ntp_error = 0;
ntp_clear();
}
-   update_vsyscall(tk);
-   update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
 
tk_update_ktime_data(tk);
 
+   update_vsyscall(tk);
+   update_pvclock_gtod(tk, action  TK_CLOCK_WAS_SET);
+
if (action  TK_MIRROR)
memcpy(shadow_timekeeper, tk_core.timekeeper,
   sizeof(tk_core.timekeeper));
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM changes for 3.17-rc4

2014-09-05 Thread Thomas Gleixner

On Fri, 5 Sep 2014, Linus Torvalds wrote:

 On Fri, Sep 5, 2014 at 3:16 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
  are available in the git repository at:
 
git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus
 
 Nothing new there. Forgot to push out, or perhaps to use -f to
 overwrite the previous tag of the same name?

And even if there would be something, please do not pull the top most
commit b11ba8c62be3eb (KVM: x86: fix kvmclock breakage from timers
branch merge).

That one is blantanly wrong and just hacks badly around a brown
paperbag bug in the core timekeeping code, which I introduced in the
last overhaul.

It's debugged and understood. Fix is posted and if confirmed by the
KVM folks it will go to you before rc4.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM changes for 3.17-rc4

2014-09-05 Thread Thomas Gleixner

On Fri, 5 Sep 2014, Paolo Bonzini wrote:

 Il 05/09/2014 22:58, Thomas Gleixner ha scritto:
  Nothing new there. Forgot to push out, or perhaps to use -f to
  overwrite the previous tag of the same name?
 
 It's there now.  Probably a --dry-run too much (I have
 push=+refs/tags/for-linus:refs/tags/for-linus in the remote configuration).
 
  And even if there would be something, please do not pull the top most
  commit b11ba8c62be3eb (KVM: x86: fix kvmclock breakage from timers
  branch merge).
  
  That one is blantanly wrong and just hacks badly around a brown
  paperbag bug in the core timekeeping code, which I introduced in the
  last overhaul.
 
 It's not wrong, it's just different.  The commit message says clearly

Right, it's different. Because you paper at the receiving end over a
core bug and that's wrong to begin with.

 that besides acting as a workaround, I find the patched code easier to
 understand, and I clearly stated the same in the tag message.

Well, we might have different opinions about easier to understand. I
did go a great length to distangle the monotonic boot time on which
you are interested from xtime, because the latter does not make any
sense outside of the core timekeeping code. Aside of that I optimized
the whole thing to avoid conversions, loops and hoops. So you just add
another multiply and add to make it more understandable. Sigh.

Thanks,

tglx






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM changes for 3.17-rc4

2014-09-05 Thread Thomas Gleixner

On Fri, 5 Sep 2014, Paolo Bonzini wrote:

 Il 05/09/2014 23:24, Thomas Gleixner ha scritto:
  
   that besides acting as a workaround, I find the patched code easier to
   understand, and I clearly stated the same in the tag message.
  Well, we might have different opinions about easier to understand. I
  did go a great length to distangle the monotonic boot time on which
  you are interested from xtime, because the latter does not make any
  sense outside of the core timekeeping code. Aside of that I optimized
  the whole thing to avoid conversions, loops and hoops. So you just add
  another multiply and add to make it more understandable. 
 
 Fair enough, I've dropped the patch.

Though I agree that the struct member name choice is not the most
brilliant one, so feel free to fix that.
 
 Thanks for helping out with the core timekeeping fix.

I'm still banging my head against the wall
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-04 Thread Thomas Gleixner

On Thu, 4 Sep 2014, Paolo Bonzini wrote:

 Commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds
 based, 2014-07-16) forgot to add tk-xtime_sec, thus breaking kvmclock on

Errm. How is boottime related to xtime_sec?

 hosts that have a reliable TSC.  Add it back; and since the field boot_ns
 is not anymore related to the host boot-based clock, rename boot_ns-nsec_base
 and the existing nsec_base-snsec_base.

This is simply wrong.

The original code before that changed did:

   vdata-monotonic_time_sec   = tk-xtime_sec
   + tk-wall_to_monotonic.tv_sec;
   vdata-monotonic_time_snsec = tk-xtime_nsec
   + (tk-wall_to_monotonic.tv_nsec
tk-shift);
So this is the momentary monotonic base time

And the readout function did:

   ts-tv_nsec = 0;
   do {
   seq = read_seqcount_begin(gtod-seq);
   mode = gtod-clock.vclock_mode;
   ts-tv_sec = gtod-monotonic_time_sec;
   ns = gtod-monotonic_time_snsec;
   ns += vgettsc(cycle_now);
   ns = gtod-clock.shift;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
   timespec_add_ns(ts, ns);

So this does:

   now = monotonic_base + delta_nsec

And the caller converted it to boot time with:

   monotonic_to_bootbased(ts);

So the time calculation does:

  now = monotonic_base + delta_nsec + mono_to_boot

Because: monotonic_base + mono_to_boot = boot_time_base
 
The calculation can be written as:

  now = boot_time_base + delta_nsec


The new code does

boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot));

So thats

   boot_time_base = monotonic_base + mono_to_boot;

vdata-boot_ns  = boot_ns;
vdata-nsec_base= tk-tkr.xtime_nsec;

And the readout does:

do {
   seq = read_seqcount_begin(gtod-seq);
   mode = gtod-clock.vclock_mode;
   ns = gtod-nsec_base;
   ns += vgettsc(cycle_now);
   ns = gtod-clock.shift;
   ns += gtod-boot_ns;
   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
   *t = ns;

Which is:

   boot_time_base + delta_nsec

Now I have no idea why you think it needs to add xtime_sec. If the
result is wrong, then we need to figure out which one of the supplied
values is wrong and not blindly add xtime_sec just because that makes
it magically correct.

Can you please provide a proper background why you think that adding
xtime_sec is a good idea?

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-04 Thread Thomas Gleixner

On Thu, 4 Sep 2014, Paolo Bonzini wrote:

 Commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds
 based, 2014-07-16) used the wrong formula for boot_ns, thus breaking kvmclock 
 on
 hosts that have a reliable TSC.
 
 To find the right formula, let's first backport the switch to nanoseconds
 to 3.16-era timekeeping logic.  The full patch (which works) is at
 https://lkml.org/lkml/2014/9/4/462.  The key line here is
 
 boot_ns = timespec_to_ns(tk-total_sleep_time)
 + timespec_to_ns(tk-wall_to_monotonic)
 + tk-xtime_sec * (u64)NSEC_PER_SEC;
 
 Because the above patch works, the conclusion is that the above formula
 is not the same as commit cbcf2dd3b3d4's
 
 boot_ns = ktime_to_ns(ktime_add(tk-tkr.base_mono, tk-offs_boot));
 
 As to what is the right one, commit 02cba1598a2a (timekeeping: Simplify 
 getboottime(),
 2014-07-16) provides a hint:
 
offs_real = -wall-to_monotonic
offs_boot =  total_sleep_time
 
offs_real - offs_boot = -wall_to_monotonic - total_sleep_time
 
 that is
 
offs_boot - offs_real =  wall_to_monotonic + total_sleep_time
 
 which is what this patch uses, adding xtime_sec separately.  The boot_ns
 moniker is not too clear, so rename boot_ns to nsec_base and the existing
 nsec_base to snsec_base.
 
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: John Stultz john.stu...@linaro.org
 Reported-by: Chris J Arges chris.j.ar...@canonical.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   Thomas/John, the problem with the above explanation is that
   tk_update_ktime_data has base_mono = xtime_sec + wtm, and from
   there base_mono + offs_boot = xtime_sec + wtm + total_sleep_time.
   Except that doesn't work, so something must be wrong in
   tk_update_ktime_data's comment.

Right. I'm staring into it and we need to fix the core code issue and
not the usage site.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix kvmclock breakage from timers branch merge

2014-09-04 Thread Thomas Gleixner

On Thu, 4 Sep 2014, Paolo Bonzini wrote:
 Il 04/09/2014 22:58, Thomas Gleixner ha scritto:
  This is simply wrong.
 
 It is.
 
  Now I have no idea why you think it needs to add xtime_sec. If the
  result is wrong, then we need to figure out which one of the supplied
  values is wrong and not blindly add xtime_sec just because that makes
  it magically correct.
  
  Can you please provide a proper background why you think that adding
  xtime_sec is a good idea?
 
 It's not a good idea indeed.  I didn't fully digest the 3.16-3.17
 timekeeping changes and messed up this patch.
 
 However, there is a bug in the base_mono + offs_boot formula, given
 that:
 
 - bisection leads to the merge commit of John's timers branch
 
 - bisecting within John's timers branch, with a KVM commit on top to
   make the code much easier to trigger, leads to commit cbcf2dd3b3d4
   (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based,
   2014-07-16).
 
 - I backported your patch to 3.16, using wall_to_monotonic +
   total_sleep_time + xtime_sec (wtm+xtime_sec as in pre-cbcf2dd3b3d4
   code, total_sleep_time from 3.16 monotonic_to_bootbased) and it works
 
 - In v2 of the patch I fixed the bug by changing the formula
   base_mono + offs_boot to offs_boot - offs_real (and then adding
   xtime_sec separately as in the 3.16 backport), but the two formulas
   base_mono + offs_boot and offs_boot - offs_real + xtime_sec ought
   to be identical.
 
 I find offs_boot - offs_real + xtime more readable than the
 alternative base_mono + offs_boot + xtime_nsec, so the fix doubles as
 a cleanup for me and I'm fine with it.  But something must be wrong in
 the timekeeping code.

I think I have a vague idea what happened, but I'm way too tired now
to write it up fully. I'll do that tomorrow morning with brain awake.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch V2 43/64] x86: kvm: Use ktime_get_boot_ns()

2014-07-16 Thread Thomas Gleixner

Use the new nanoseconds based interface and get rid of the timespec
conversion dance.

Signed-off-by: Thomas Gleixner t...@linutronix.de
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/kvm/x86.c |6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

Index: tip/arch/x86/kvm/x86.c
===
--- tip.orig/arch/x86/kvm/x86.c
+++ tip/arch/x86/kvm/x86.c
@@ -1109,11 +1109,7 @@ static void kvm_get_time_scale(uint32_t
 
 static inline u64 get_kernel_ns(void)
 {
-   struct timespec ts;
-
-   ktime_get_ts(ts);
-   monotonic_to_bootbased(ts);
-   return timespec_to_ns(ts);
+   return ktime_get_boot_ns();
 }
 
 #ifdef CONFIG_X86_64


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch V2 44/64] x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based

2014-07-16 Thread Thomas Gleixner

Convert the relevant base data right away to nanoseconds instead of
doing the conversion on every readout. Reduces text size by 160 bytes.

Signed-off-by: Thomas Gleixner t...@linutronix.de
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/kvm/x86.c |   44 ++--
 1 file changed, 14 insertions(+), 30 deletions(-)

Index: tip/arch/x86/kvm/x86.c
===
--- tip.orig/arch/x86/kvm/x86.c
+++ tip/arch/x86/kvm/x86.c
@@ -984,9 +984,8 @@ struct pvclock_gtod_data {
u32 shift;
} clock;
 
-   /* open coded 'struct timespec' */
-   u64 monotonic_time_snsec;
-   time_t  monotonic_time_sec;
+   u64 boot_ns;
+   u64 nsec_base;
 };
 
 static struct pvclock_gtod_data pvclock_gtod_data;
@@ -994,6 +993,9 @@ static struct pvclock_gtod_data pvclock_
 static void update_pvclock_gtod(struct timekeeper *tk)
 {
struct pvclock_gtod_data *vdata = pvclock_gtod_data;
+   u64 boot_ns;
+
+   boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot));
 
write_seqcount_begin(vdata-seq);
 
@@ -1004,17 +1006,8 @@ static void update_pvclock_gtod(struct t
vdata-clock.mult   = tk-mult;
vdata-clock.shift  = tk-shift;
 
-   vdata-monotonic_time_sec   = tk-xtime_sec
-   + tk-wall_to_monotonic.tv_sec;
-   vdata-monotonic_time_snsec = tk-xtime_nsec
-   + (tk-wall_to_monotonic.tv_nsec
-tk-shift);
-   while (vdata-monotonic_time_snsec =
-   (((u64)NSEC_PER_SEC)  tk-shift)) {
-   vdata-monotonic_time_snsec -=
-   ((u64)NSEC_PER_SEC)  tk-shift;
-   vdata-monotonic_time_sec++;
-   }
+   vdata-boot_ns  = boot_ns;
+   vdata-nsec_base= tk-xtime_nsec;
 
write_seqcount_end(vdata-seq);
 }
@@ -1371,23 +1364,22 @@ static inline u64 vgettsc(cycle_t *cycle
return v * gtod-clock.mult;
 }
 
-static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+static int do_monotonic_boot(s64 *t, cycle_t *cycle_now)
 {
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
unsigned long seq;
-   u64 ns;
int mode;
-   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+   u64 ns;
 
-   ts-tv_nsec = 0;
do {
seq = read_seqcount_begin(gtod-seq);
mode = gtod-clock.vclock_mode;
-   ts-tv_sec = gtod-monotonic_time_sec;
-   ns = gtod-monotonic_time_snsec;
+   ns = gtod-nsec_base;
ns += vgettsc(cycle_now);
ns = gtod-clock.shift;
+   ns += gtod-boot_ns;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
-   timespec_add_ns(ts, ns);
+   *t = ns;
 
return mode;
 }
@@ -1395,19 +1387,11 @@ static int do_monotonic(struct timespec
 /* returns true if host is using tsc clocksource */
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
 {
-   struct timespec ts;
-
/* checked again under seqlock below */
if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
return false;
 
-   if (do_monotonic(ts, cycle_now) != VCLOCK_TSC)
-   return false;
-
-   monotonic_to_bootbased(ts);
-   *kernel_ns = timespec_to_ns(ts);
-
-   return true;
+   return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
 }
 #endif
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 43/55] x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based

2014-07-11 Thread Thomas Gleixner

Convert the relevant base data right away to nanoseconds instead of
doing the conversion on every readout. Reduces text size by 160 bytes.

Signed-off-by: Thomas Gleixner t...@linutronix.de
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/kvm/x86.c |   44 ++--
 1 file changed, 14 insertions(+), 30 deletions(-)

Index: tip/arch/x86/kvm/x86.c
===
--- tip.orig/arch/x86/kvm/x86.c
+++ tip/arch/x86/kvm/x86.c
@@ -984,9 +984,8 @@ struct pvclock_gtod_data {
u32 shift;
} clock;
 
-   /* open coded 'struct timespec' */
-   u64 monotonic_time_snsec;
-   time_t  monotonic_time_sec;
+   u64 boot_ns;
+   u64 nsec_base;
 };
 
 static struct pvclock_gtod_data pvclock_gtod_data;
@@ -994,6 +993,9 @@ static struct pvclock_gtod_data pvclock_
 static void update_pvclock_gtod(struct timekeeper *tk)
 {
struct pvclock_gtod_data *vdata = pvclock_gtod_data;
+   u64 boot_ns;
+
+   boot_ns = ktime_to_ns(ktime_add(tk-base_mono, tk-offs_boot));
 
write_seqcount_begin(vdata-seq);
 
@@ -1004,17 +1006,8 @@ static void update_pvclock_gtod(struct t
vdata-clock.mult   = tk-mult;
vdata-clock.shift  = tk-shift;
 
-   vdata-monotonic_time_sec   = tk-xtime_sec
-   + tk-wall_to_monotonic.tv_sec;
-   vdata-monotonic_time_snsec = tk-xtime_nsec
-   + (tk-wall_to_monotonic.tv_nsec
-tk-shift);
-   while (vdata-monotonic_time_snsec =
-   (((u64)NSEC_PER_SEC)  tk-shift)) {
-   vdata-monotonic_time_snsec -=
-   ((u64)NSEC_PER_SEC)  tk-shift;
-   vdata-monotonic_time_sec++;
-   }
+   vdata-boot_ns  = boot_ns;
+   vdata-nsec_base= tk-xtime_nsec;
 
write_seqcount_end(vdata-seq);
 }
@@ -1371,23 +1364,22 @@ static inline u64 vgettsc(cycle_t *cycle
return v * gtod-clock.mult;
 }
 
-static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+static int do_monotonic_boot(s64 *t, cycle_t *cycle_now)
 {
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
unsigned long seq;
-   u64 ns;
int mode;
-   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+   u64 ns;
 
-   ts-tv_nsec = 0;
do {
seq = read_seqcount_begin(gtod-seq);
mode = gtod-clock.vclock_mode;
-   ts-tv_sec = gtod-monotonic_time_sec;
-   ns = gtod-monotonic_time_snsec;
+   ns = gtod-nsec_base;
ns += vgettsc(cycle_now);
ns = gtod-clock.shift;
+   ns += gtod-boot_ns;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
-   timespec_add_ns(ts, ns);
+   *t = ns;
 
return mode;
 }
@@ -1395,19 +1387,11 @@ static int do_monotonic(struct timespec
 /* returns true if host is using tsc clocksource */
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
 {
-   struct timespec ts;
-
/* checked again under seqlock below */
if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
return false;
 
-   if (do_monotonic(ts, cycle_now) != VCLOCK_TSC)
-   return false;
-
-   monotonic_to_bootbased(ts);
-   *kernel_ns = timespec_to_ns(ts);
-
-   return true;
+   return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
 }
 #endif
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 42/55] x86: kvm: Use ktime_get_boot_ns()

2014-07-11 Thread Thomas Gleixner

Use the new nanoseconds based interface and get rid of the timespec
conversion dance.

Signed-off-by: Thomas Gleixner t...@linutronix.de
Cc: Gleb Natapov g...@kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/kvm/x86.c |6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

Index: tip/arch/x86/kvm/x86.c
===
--- tip.orig/arch/x86/kvm/x86.c
+++ tip/arch/x86/kvm/x86.c
@@ -1109,11 +1109,7 @@ static void kvm_get_time_scale(uint32_t
 
 static inline u64 get_kernel_ns(void)
 {
-   struct timespec ts;
-
-   ktime_get_ts(ts);
-   monotonic_to_bootbased(ts);
-   return timespec_to_ns(ts);
+   return ktime_get_boot_ns();
 }
 
 #ifdef CONFIG_X86_64


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Thomas Gleixner

On Wed, 28 May 2014, Linus Torvalds wrote:
 
 If somebody has a P4 still, that's likely the worst case by far.

I do, but I'm only using it during winter and only if the ia64 machine
does not provide sufficient heating. So you have to wait at least half
a year until I'm able to test it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Preemptable Ticket Spinlock

2013-05-30 Thread Thomas Gleixner

On Thu, 30 May 2013, Raghavendra K T wrote:
 Here is the branch with pvpspinlock V9 version in github reabsed to  3.10-rc
 
 https://github.com/ktraghavendra/linux/tree/pvspinlock_v9
 
 planning post a formal email in a separate thread with link a to this
 branch (instead of spamming with 19 patches)

19 patches is not really spam if you compare it to the total number of
mails per day on LKML. 

The git tree is nice for people who want to test stuff easily, but if
you want people to review and comment patches, then please use mail.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM fixes for 3.5-rc6

2012-07-14 Thread Thomas Gleixner

On Sat, 14 Jul 2012, Jan Kiszka wrote:
 On 2012-07-14 04:25, Thomas Gleixner wrote:
 This patch here is a workaround to unbreak devices assignment in 3.5
 after the IRQ layer changes without regressing noticeable /wrt overhead.

Yeah, workaround and regression are the proper marketing buzzwords to
excuse mindless hackery.

It took me a minute to figure out that there is no reason at all to
use a threaded interrupt handler for MSI and MSIX.

If you folks are so concerned about performance and overhead then
someone familiar with that code should have done the obvious change
long before the oneshot modifications took place.

I can't be bothered to do a performance test myself, but I bet it's
making an order of magnitude more of a difference than the oh so
noticeable few cycles in irq_finalize_oneshot().

Thanks,

tglx


Index: linux-2.6/virt/kvm/assigned-dev.c
===
--- linux-2.6.orig/virt/kvm/assigned-dev.c
+++ linux-2.6/virt/kvm/assigned-dev.c
@@ -105,7 +105,7 @@ static irqreturn_t kvm_assigned_dev_thre
 }
 
 #ifdef __KVM_HAVE_MSI
-static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
+static irqreturn_t kvm_assigned_dev_msi_handler(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 
@@ -117,7 +117,7 @@ static irqreturn_t kvm_assigned_dev_thre
 #endif
 
 #ifdef __KVM_HAVE_MSIX
-static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
+static irqreturn_t kvm_assigned_dev_msix_handler(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
int index = find_index_from_host_irq(assigned_dev, irq);
@@ -334,11 +334,6 @@ static int assigned_device_enable_host_i
 }
 
 #ifdef __KVM_HAVE_MSI
-static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
-{
-   return IRQ_WAKE_THREAD;
-}
-
 static int assigned_device_enable_host_msi(struct kvm *kvm,
   struct kvm_assigned_dev_kernel *dev)
 {
@@ -351,9 +346,8 @@ static int assigned_device_enable_host_m
}
 
dev-host_irq = dev-dev-irq;
-   if (request_threaded_irq(dev-host_irq, kvm_assigned_dev_msi,
-kvm_assigned_dev_thread_msi, 0,
-dev-irq_name, dev)) {
+   if (request_irq(dev-host_irq, kvm_assigned_dev_msi_handler, 0,
+   dev-irq_name, dev)) {
pci_disable_msi(dev-dev);
return -EIO;
}
@@ -363,11 +357,6 @@ static int assigned_device_enable_host_m
 #endif
 
 #ifdef __KVM_HAVE_MSIX
-static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
-{
-   return IRQ_WAKE_THREAD;
-}
-
 static int assigned_device_enable_host_msix(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
 {
@@ -383,10 +372,9 @@ static int assigned_device_enable_host_m
return r;
 
for (i = 0; i  dev-entries_nr; i++) {
-   r = request_threaded_irq(dev-host_msix_entries[i].vector,
-kvm_assigned_dev_msix,
-kvm_assigned_dev_thread_msix,
-0, dev-irq_name, dev);
+   r = request_irq(dev-host_msix_entries[i].vector,
+   kvm_assigned_dev_msix_handler,
+   0, dev-irq_name, dev);
if (r)
goto err;
}





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM fixes for 3.5-rc6

2012-07-14 Thread Thomas Gleixner

On Sat, 14 Jul 2012, Jan Kiszka wrote:
 On 2012-07-14 13:16, Thomas Gleixner wrote:
  On Sat, 14 Jul 2012, Jan Kiszka wrote:
  On 2012-07-14 04:25, Thomas Gleixner wrote:
  This patch here is a workaround to unbreak devices assignment in 3.5
  after the IRQ layer changes without regressing noticeable /wrt overhead.
  
  Yeah, workaround and regression are the proper marketing buzzwords to
  excuse mindless hackery.
  
  It took me a minute to figure out that there is no reason at all to
  use a threaded interrupt handler for MSI and MSIX.
 
 Thomas, we also explained to you in the cited thread that your simple
 approach for this doesn't work as is. We will have a proper solution
 soon, but it takes a bit more than a minute - at least us.

And I explained to you in that very thread that the proper solution to
avoid the overhead of finalize_oneshot is exaclty the patch I sent
to Linus yesterday:

 The only way we can avoid that, is that we get a hint from the
 underlying irq chip/ handler setup with an extra flag to tell the
 core, that it's safe to avoid the ONESHOT/finalize magic.

So now it took a full month of ignorance to come up with the
mindboggling solution of working around the core change with a private
hack instead of sitting down and doing what was said to be the correct
solution.

And that's what seriously annoys me. Instead of doing it yourself or
at least politely poking me to get it done, stuff just gets hacked
into submission and sold as the performance regression saviour.

Of course you are free to ignore my advice, but that does not mean
that I take bullshit from you.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM fixes for 3.5-rc6

2012-07-13 Thread Thomas Gleixner

On Fri, 13 Jul 2012, Linus Torvalds wrote:

 On Fri, Jul 13, 2012 at 8:45 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 At the same time, I do wonder if maybe MSI + IRQF_ONESHOT couldn't be
 improved. The fact that the KVM people think that the extra overhead
 of IRQF_ONESHOT is a bad thing for MSI interrupts makes me wonder if
 maybe this wouldn't be an area the irq layer couldn't be improved on.
 Maybe the MSI+IRQF_ONESHOT case could be improved. Because MSI is kind
 of fundamentally one-shot, since it's a message-based irq scheme.  So
 maybe the extra overhead is unnecessary in general, not just in this
 particular KVM case. Hmm?
 
 Thomas, see the commentary of a76beb14123a (KVM: Fix device
 assignment threaded irq handler).

Groan.

We already discussed to let the irq chip (in this case MSI) tell the
core that it does not need the extra oneshot handling. That way the
code which requests an threaded irq with the NULL primary handler
works on both MSI and normal interrupts.

Untested patch below.

Thanks,

tglx

-
Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3109,6 +3109,7 @@ static struct irq_chip msi_chip = {
.irq_set_affinity   = msi_set_affinity,
 #endif
.irq_retrigger  = ioapic_retrigger_irq,
+   .flags  = IRQCHIP_ONESHOT_SAFE,
 };
 
 static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int 
irq)
Index: linux-2.6/include/linux/irq.h
===
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -351,6 +351,7 @@ enum {
IRQCHIP_MASK_ON_SUSPEND = (1   2),
IRQCHIP_ONOFFLINE_ENABLED   = (1   3),
IRQCHIP_SKIP_SET_WAKE   = (1   4),
+   IRQCHIP_ONESHOT_SAFE= (1   5),
 };
 
 /* This include will go away once we isolated irq_desc usage to core code */
Index: linux-2.6/kernel/irq/manage.c
===
--- linux-2.6.orig/kernel/irq/manage.c
+++ linux-2.6/kernel/irq/manage.c
@@ -1004,35 +1004,48 @@ __setup_irq(unsigned int irq, struct irq
 */
if (new-flags  IRQF_ONESHOT) {
/*
-* Unlikely to have 32 resp 64 irqs sharing one line,
-* but who knows.
+* Drivers are often written to work w/o knowledge
+* about the underlying irq chip implementation, so a
+* request for a threaded irq without a primary hard
+* irq context handler requires the ONESHOT flag to be
+* set. Some irq chips like MSI based interrupts are
+* per se one shot safe. Check the chip flags, so we
+* can avoid the unmask dance at the end of the
+* threaded handler for those.
 */
-   if (thread_mask == ~0UL) {
-   ret = -EBUSY;
-   goto out_mask;
+   if (desc-irq_data.chip-flags  IRQCHIP_ONESHOT_SAFE) {
+   new-flags = ~IRQF_ONESHOT;
+   } else {
+   /*
+* Unlikely to have 32 resp 64 irqs sharing one line,
+* but who knows.
+*/
+   if (thread_mask == ~0UL) {
+   ret = -EBUSY;
+   goto out_mask;
+   }
+   /*
+* The thread_mask for the action is or'ed to
+* desc-thread_active to indicate that the
+* IRQF_ONESHOT thread handler has been woken, but not
+* yet finished. The bit is cleared when a thread
+* completes. When all threads of a shared interrupt
+* line have completed desc-threads_active becomes
+* zero and the interrupt line is unmasked. See
+* handle.c:irq_wake_thread() for further information.
+*
+* If no thread is woken by primary (hard irq context)
+* interrupt handlers, then desc-threads_active is
+* also checked for zero to unmask the irq line in the
+* affected hard irq flow handlers
+* (handle_[fasteoi|level]_irq).
+*
+* The new action gets the first zero bit of
+* thread_mask assigned. See the loop above which or's
+* all existing action-thread_mask bits.
+*/
+   new-thread_mask = 1  ffz(thread_mask);
}

Re: [GIT PULL] KVM fixes for 3.5-rc6

2012-07-13 Thread Thomas Gleixner

On Fri, 13 Jul 2012, Linus Torvalds wrote:

 On Fri, Jul 13, 2012 at 11:28 AM, Thomas Gleixner t...@linutronix.de wrote:
 
  We already discussed to let the irq chip (in this case MSI) tell the
  core that it does not need the extra oneshot handling. That way the
  code which requests an threaded irq with the NULL primary handler
  works on both MSI and normal interrupts.
 
 So I  don't think your patch is quite right.
 
 If you want to clear the IRQF_ONESHOT for MSI irq's (and other ones
 where the interrupt controller is fundamentally ONESHOT), I think you
 should do it a few lines higher up - *before* you check the does the
 IRQF_ONESHOT mask match other shared interrupts?
 
 Now, irq sharing presumably doesn't happen with MSI, but there's
 nothing fundamentally wrong with message-based irq schemes that have
 shared interrupt handlers.
 
 I think. Hmm?

Shared irqs are not supported by MSI, but yes, the check should be
done way up. Makes it less ugly as well :)

Thanks,

tglx

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3109,6 +3109,7 @@ static struct irq_chip msi_chip = {
.irq_set_affinity   = msi_set_affinity,
 #endif
.irq_retrigger  = ioapic_retrigger_irq,
+   .flags  = IRQCHIP_ONESHOT_SAFE,
 };
 
 static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int 
irq)
Index: linux-2.6/include/linux/irq.h
===
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -351,6 +351,7 @@ enum {
IRQCHIP_MASK_ON_SUSPEND = (1   2),
IRQCHIP_ONOFFLINE_ENABLED   = (1   3),
IRQCHIP_SKIP_SET_WAKE   = (1   4),
+   IRQCHIP_ONESHOT_SAFE= (1   5),
 };
 
 /* This include will go away once we isolated irq_desc usage to core code */
Index: linux-2.6/kernel/irq/manage.c
===
--- linux-2.6.orig/kernel/irq/manage.c
+++ linux-2.6/kernel/irq/manage.c
@@ -960,6 +960,18 @@ __setup_irq(unsigned int irq, struct irq
}
 
/*
+* Drivers are often written to work w/o knowledge about the
+* underlying irq chip implementation, so a request for a
+* threaded irq without a primary hard irq context handler
+* requires the ONESHOT flag to be set. Some irq chips like
+* MSI based interrupts are per se one shot safe. Check the
+* chip flags, so we can avoid the unmask dance at the end of
+* the threaded handler for those.
+*/
+   if (desc-irq_data.chip-flags  IRQCHIP_ONESHOT_SAFE)
+   new-flags = ~IRQF_ONESHOT;
+
+   /*
 * The following block of code has to be executed atomically
 */
raw_spin_lock_irqsave(desc-lock, flags);
@@ -1033,7 +1045,8 @@ __setup_irq(unsigned int irq, struct irq
 */
new-thread_mask = 1  ffz(thread_mask);
 
-   } else if (new-handler == irq_default_primary_handler) {
+   } else if (new-handler == irq_default_primary_handler 
+  !(desc-irq_data.chip-flags  IRQCHIP_ONESHOT_SAFE)) {
/*
 * The interrupt was requested with handler = NULL, so
 * we use the default primary handler for it. But it


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM fixes for 3.5-rc6

2012-07-13 Thread Thomas Gleixner

On Fri, 13 Jul 2012, Thomas Gleixner wrote:
 On Fri, 13 Jul 2012, Linus Torvalds wrote:
  On Fri, Jul 13, 2012 at 8:45 AM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  At the same time, I do wonder if maybe MSI + IRQF_ONESHOT couldn't be
  improved. The fact that the KVM people think that the extra overhead
  of IRQF_ONESHOT is a bad thing for MSI interrupts makes me wonder if
  maybe this wouldn't be an area the irq layer couldn't be improved on.
  Maybe the MSI+IRQF_ONESHOT case could be improved. Because MSI is kind
  of fundamentally one-shot, since it's a message-based irq scheme.  So
  maybe the extra overhead is unnecessary in general, not just in this
  particular KVM case. Hmm?
  
  Thomas, see the commentary of a76beb14123a (KVM: Fix device
  assignment threaded irq handler).
 
 Groan.
 
 We already discussed to let the irq chip (in this case MSI) tell the
 core that it does not need the extra oneshot handling. That way the
 code which requests an threaded irq with the NULL primary handler
 works on both MSI and normal interrupts.

That's the kind of stuff which makes me go berserk, and just for the
record: the most complaints I get for going berserk are coming from
the virt folks.

I really can't understand why the virt folks think they are
special and do not have to talk to core maintainers.

Back then when I was doing the big irq cleanup, virt crap stood out by
far with the most idiotic^Wcreative workarounds. Of course nobody
complained about me being moronic enough to come up with generic
solutions for their problems.

Though especially that commit including its changelog proves once
again the ignorance and desinterest of the virt crowd in solving
problems which are not only relevant to themself.

I whish you'd just refused to pull that nonsense and instead made them
talk to those folks who had a proper solution in mind already.

In fact we could have solved that proper weeks ago, if only people
would have raised the issue.

I'm tired of that kind of crap, really.

Thomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts

2012-06-04 Thread Thomas Gleixner

On Sun, 3 Jun 2012, Avi Kivity wrote:

 On 06/01/2012 09:26 PM, Jan Kiszka wrote:
  
  you suggesting we need a request_edge_threaded_only_irq() API?  Thanks,
  
  I'm just wondering if that restriction for threaded IRQs is really
  necessary for all use cases we have. Threaded MSIs do not appear to me
  like have to be handled that conservatively, but maybe I'm missing some
  detail.
  
 
 btw, I'm hoping we can unthread assigned MSIs.  If the delivery is
 unicast, we can precalculate everything and all the handler has to do is
 set the IRR, KVM_REQ_EVENT, and kick the vcpu.  All of these can be done
 from interrupt context with just RCU locking.

There is really no need to run MSI/MSI-X interrupts threaded for
KVM. I'm running the patch below for quite some time and it works like
a charm.

Thanks,

tglx

Index: linux-2.6/virt/kvm/assigned-dev.c
===
--- linux-2.6.orig/virt/kvm/assigned-dev.c
+++ linux-2.6/virt/kvm/assigned-dev.c
@@ -105,7 +105,7 @@ static irqreturn_t kvm_assigned_dev_thre
 }
 
 #ifdef __KVM_HAVE_MSI
-static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
+static irqreturn_t kvm_assigned_dev_msi_handler(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 
@@ -117,7 +117,7 @@ static irqreturn_t kvm_assigned_dev_thre
 #endif
 
 #ifdef __KVM_HAVE_MSIX
-static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
+static irqreturn_t kvm_assigned_dev_msix_handler(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
int index = find_index_from_host_irq(assigned_dev, irq);
@@ -346,9 +346,8 @@ static int assigned_device_enable_host_m
}
 
dev-host_irq = dev-dev-irq;
-   if (request_threaded_irq(dev-host_irq, NULL,
-kvm_assigned_dev_thread_msi, 0,
-dev-irq_name, dev)) {
+   if (request_irq(dev-host_irq, kvm_assigned_dev_msi_handler, 0,
+   dev-irq_name, dev)) {
pci_disable_msi(dev-dev);
return -EIO;
}
@@ -373,9 +372,9 @@ static int assigned_device_enable_host_m
return r;
 
for (i = 0; i  dev-entries_nr; i++) {
-   r = request_threaded_irq(dev-host_msix_entries[i].vector,
-NULL, kvm_assigned_dev_thread_msix,
-0, dev-irq_name, dev);
+   r = request_irq(dev-host_msix_entries[i].vector,
+   kvm_assigned_dev_msix_handler, 0,
+   dev-irq_name, dev);
if (r)
goto err;
}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts

2012-06-04 Thread Thomas Gleixner

On Mon, 4 Jun 2012, Jan Kiszka wrote:
 On 2012-06-04 13:21, Thomas Gleixner wrote:
 So this shortcut requires some checks before being applied to a specific
 MSI/MSI-X vector.
 
 
 Taking KVM aside, my general question remains if threaded MSI handlers
 of all devices really need to apply IRQF_ONESHOT though they should have
 no use for it.

In theory no, but we had more than one incident, where threaded irqs
w/o a primary handler and w/o IRQF_ONEHSOT lead to full system
starvation. Linus requested this sanity check and I think it's sane
and required.

In fact it's a non issue for MSI. MSI uses handle_edge_irq which does
not mask the interrupt. IRQF_ONESHOT is a noop for that flow handler.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Use IRQF_ONESHOT for assigned device MSI interrupts

2012-06-04 Thread Thomas Gleixner

On Mon, 4 Jun 2012, Jan Kiszka wrote:

 On 2012-06-04 15:07, Thomas Gleixner wrote:
  On Mon, 4 Jun 2012, Jan Kiszka wrote:
  On 2012-06-04 13:21, Thomas Gleixner wrote:
  So this shortcut requires some checks before being applied to a specific
  MSI/MSI-X vector.
 
 
  Taking KVM aside, my general question remains if threaded MSI handlers
  of all devices really need to apply IRQF_ONESHOT though they should have
  no use for it.
  
  In theory no, but we had more than one incident, where threaded irqs
  w/o a primary handler and w/o IRQF_ONEHSOT lead to full system
  starvation. Linus requested this sanity check and I think it's sane
  and required.
 
 OK.
 
  
  In fact it's a non issue for MSI. MSI uses handle_edge_irq which does
  not mask the interrupt. IRQF_ONESHOT is a noop for that flow handler.
 
 Isn't irq_finalize_oneshot processes for all flows?

Right, forgot about that. The only way we can avoid that, is that we
get a hint from the underlying irq chip/ handler setup with an extra
flag to tell the core, that it's safe to avoid the ONESHOT/finalize
magic.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

2012-05-24 Thread Thomas Gleixner

On Thu, 24 May 2012, Alex Williamson wrote:
 On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote:
  +if (address == msi_start + PCI_MSI_DATA_32)
  +handle_cfg_write_msi(pci_dev, assigned_dev);
 
 Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap
 + PCI_MSI_DATA_32) to start with?  But how does this handle the enable
 bit?

The problem with the current implementation is that it only changes
the routing if the msi entry goes from masked to unmasked state.

Linux does not mask the entries on affinity changes and never did,
neither for MSI nor for MSI-X.

I know it's probably not according to the spec, but we can't fix that
retroactively.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

2012-05-24 Thread Thomas Gleixner

On Thu, 24 May 2012, Alex Williamson wrote:
 On Thu, 2012-05-24 at 18:53 -0300, Jan Kiszka wrote:
  On 2012-05-24 18:39, Thomas Gleixner wrote:
   On Thu, 24 May 2012, Alex Williamson wrote:
   On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote:
   +if (address == msi_start + PCI_MSI_DATA_32)
   +handle_cfg_write_msi(pci_dev, assigned_dev);
  
   Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap
   + PCI_MSI_DATA_32) to start with?  But how does this handle the enable
   bit?
   
   The problem with the current implementation is that it only changes
   the routing if the msi entry goes from masked to unmasked state.
   
   Linux does not mask the entries on affinity changes and never did,
   neither for MSI nor for MSI-X.
   
   I know it's probably not according to the spec, but we can't fix that
   retroactively.
  
  For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug,
  waiting for hardware to dislike this spec violation.
  
  However, if this is the current behavior of such a prominent guest, I
  guess we have to stop optimizing the QEMU MSI-X code that it only
  updates routings on mask changes. Possibly other OSes get this wrong too...
  
  Thanks, for the clarification. Should go into the changelog.
 
 Hmm, if Linux didn't mask MSIX before updating vectors it'd not only be
 a spec violation, but my testing of the recent changes to fix MSIX
 vector updates for exactly this would have failed...
 
 } else if (msix_masked(orig)  !msix_masked(entry)) {
 ... update vector...
 
 So I'm not entirely sure I believe that.  Thanks,

What happens is:

A write to /proc/irq/$N/smp_affinity calls into irq_set_affinity()
which does:

if (irq_can_move_pcntxt(data)) {
ret = chip-irq_set_affinity(data, mask, false);
} else {
irqd_set_move_pending(data);
irq_copy_pending(desc, mask);
}

MSI and MSI-X fall into the !irq_can_move_pcntxt() code path unless
the irq is remapped, which is not the case in a guest. That means that
we merily copy the new mask and set the move pending bit. 

MSI/MSI-X use the edge handler so on the next incoming interrupt, we
do

  irq_desc-chip-irq_ack()

which ends up in ack_apic_edge() which does:

static void ack_apic_edge(struct irq_data *data)
{
irq_complete_move(data-chip_data);
irq_move_irq(data);
ack_APIC_irq();
}

irq_move_irq() is the interesting function. And that does

  irq_desc-chip-irq_mask()

before calling the irq_set_affinity() function which actually changes
the masks.
  
-irq_mask() ends up in mask_msi_irq(). 

Now that calls msi_set_mask_bit() and for MSI-X that actually masks
the irq. So ignore my MSI-X comment.

But for MSI this ends up in msi_mask_irq() which does:

if (!desc-msi_attrib.maskbit)
   return 0;

So in case desc-msi_attrib.maskbit is 0 we do not write anything out
and then the masked/unmasked logic in qemu fails.

Sorry, that I did not decode that down to this level before, but I was
in a hurry and assumed correctly that qemu is doing something
wrong. Not being familiar with that code did not help either :)

So the proper fix is that qemu tells the guest that mask bit is
supported and catches the mask bit toggling before writing it out to
the hardware for those devices which do not support it.

We'll have another look.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

2012-05-24 Thread Thomas Gleixner

On Fri, 25 May 2012, Michael S. Tsirkin wrote:
 On Thu, May 24, 2012 at 06:53:15PM -0300, Jan Kiszka wrote:
  On 2012-05-24 18:39, Thomas Gleixner wrote:
   On Thu, 24 May 2012, Alex Williamson wrote:
   On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote:
   +if (address == msi_start + PCI_MSI_DATA_32)
   +handle_cfg_write_msi(pci_dev, assigned_dev);
  
   Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap
   + PCI_MSI_DATA_32) to start with?  But how does this handle the enable
   bit?
   
   The problem with the current implementation is that it only changes
   the routing if the msi entry goes from masked to unmasked state.
   
   Linux does not mask the entries on affinity changes and never did,
   neither for MSI nor for MSI-X.
   
   I know it's probably not according to the spec, but we can't fix that
   retroactively.
  
  For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug,
  waiting for hardware to dislike this spec violation.
  
  However, if this is the current behavior of such a prominent guest, I
  guess we have to stop optimizing the QEMU MSI-X code that it only
  updates routings on mask changes. Possibly other OSes get this wrong too...
 
 Very strange, a clear spec violation. I'll have to dig in the source to
 verify this.

Stop digging. MSI-X is correct.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

2012-05-24 Thread Thomas Gleixner

On Fri, 25 May 2012, Thomas Gleixner wrote:

 On Fri, 25 May 2012, Michael S. Tsirkin wrote:
  On Thu, May 24, 2012 at 06:53:15PM -0300, Jan Kiszka wrote:
   On 2012-05-24 18:39, Thomas Gleixner wrote:
On Thu, 24 May 2012, Alex Williamson wrote:
On Thu, 2012-05-24 at 18:02 +0100, Richard Weinberger wrote:
+if (address == msi_start + PCI_MSI_DATA_32)
+handle_cfg_write_msi(pci_dev, assigned_dev);
   
Why didn't we just use range_covers_byte(address, len, pci_dev-msi_cap
+ PCI_MSI_DATA_32) to start with?  But how does this handle the enable
bit?

The problem with the current implementation is that it only changes
the routing if the msi entry goes from masked to unmasked state.

Linux does not mask the entries on affinity changes and never did,
neither for MSI nor for MSI-X.

I know it's probably not according to the spec, but we can't fix that
retroactively.
   
   For MSI, this is allowed. For MSI-X, this would clearly be a Linux bug,
   waiting for hardware to dislike this spec violation.
   
   However, if this is the current behavior of such a prominent guest, I
   guess we have to stop optimizing the QEMU MSI-X code that it only
   updates routings on mask changes. Possibly other OSes get this wrong 
   too...
  
  Very strange, a clear spec violation. I'll have to dig in the source to
  verify this.
 
 Stop digging. MSI-X is correct.

This was based off an older version of qemu-kvm, where the routing for
MSI-X was broken for other reasons. But that seems to be fixed now.

I use the age excuse :)

Thanks,

tglx

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Device assignment: Fix MSI IRQ affinity setting

2012-05-24 Thread Thomas Gleixner

On Thu, 24 May 2012, Alex Williamson wrote:

 On Fri, 2012-05-25 at 01:01 +0200, Thomas Gleixner wrote:
  So the proper fix is that qemu tells the guest that mask bit is
  supported and catches the mask bit toggling before writing it out to
  the hardware for those devices which do not support it.
 
 We can't necessarily do that, we have to work with the config space
 we're give.  Using the smallest possible MSI capability always works.
 Adding mask bits may not fit in with the existing capabilities of the
 physical device.  Thanks,

I see what you mean. A random device driver of a random guest OS might
rely on that information. Unlikely, but 

So we need some logic to circumvent the masked/unmasked logic in case
that property is not set, right ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-23 Thread Thomas Gleixner

On Wed, 23 May 2012, Avi Kivity wrote:

 On 05/22/2012 08:26 PM, Thomas Gleixner wrote:
  On Tue, 22 May 2012, Avi Kivity wrote:
  On 05/22/2012 12:04 AM, Thomas Gleixner wrote:
   The only justification for having the same layout as the actual
   hardware is when you are going to map the memory into the guest space,
   which is not the case here.
 
  The APIC page is in fact mapped  to the hardware (not the guest, but vmx
  microcode does access it).  Only one register, the TPR, is ever used. 
  It's possible to re-layout the data structure so that the TPR stays in
  the same place while everything else becomes contiguous, but we'll have
  to do it again if the hardware starts mapping more registers.
  
  I would avoid that by having a compressed version which reflects the
  SW state and the mapped one which allows the vmx microcode to fiddle
  with the TPR. If you need more registers in the HW page then you don't
  have to worry about the layout and just have a proper accessor for
  that.
 
 That works, but replaces one problem with another: now we have two
 sources for the same data, and need to juggle between them depending on
 register number (either synchronizing in both directions, or special
 casing); so you're simplifying one thing at the expense of the other.
 If the microcode starts accessing more registers, then having two
 layouts becomes even uglier.

Fair enough :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-23 Thread Thomas Gleixner

On Wed, 23 May 2012, Avi Kivity wrote:
 On 05/23/2012 05:48 PM, Ingo Molnar wrote:
  
  This is silly.  Most of the time the kernel is advanced by 
  incremental patches.  Sometimes it is advanced by minor or 
  major refactoring.  It is never advanced by personal attacks 
  on contributors.
  
  Thomas wasn't so much doing a personal attack, it was pointing 
  out stupidity and then it was mocking the repeated stupidity. He 
  very politely explained his point of view (with which I agree), 
 
 I guess we disagree on what is polite or not.  Mocking, for example,
 isn't part of it in my book.

I really avoid flaming as far as it goes, but I consider that ignoring
a thorough patch review and replying only on the very obvious problem
is a massive form of impoliteness. Replying on a still polite reminder
with a sloppy I just took what's there and implemeted the
optimization which I was tasked with is even more of an offense. It
clearly shows that there is no interest in making the whole thing
better and just aims for quick and dirty here are my benchmark
results success.

This is what actually triggered me to switch into grumpy mode.

I'm really tired of that attitude. It's the root cause for the steady
increasing mess in the kernel.

It forces people who actually care to waste an endless amount of time
to clean up what has been just slapped into the code base quick and
dirty. And you expect those people to stay calm and polite if they get
ignored and ridiculed with sloppy replies?

You may be a saint, but I'm definitely too old to cope with that.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-23 Thread Thomas Gleixner

On Wed, 23 May 2012, H. Peter Anvin wrote:

 On 05/23/2012 11:37 AM, Thomas Gleixner wrote:
 
  That works, but replaces one problem with another: now we have two
  sources for the same data, and need to juggle between them depending on
  register number (either synchronizing in both directions, or special
  casing); so you're simplifying one thing at the expense of the other.
  If the microcode starts accessing more registers, then having two
  layouts becomes even uglier.
  
  Fair enough :)
 
 Yes, the µcode accessing this data structure directly probably falls
 under the category of a legitimate need to stick to the hardware format.

Thought more about that.

We have a clear distinction between HW accessed data and software
accessed data.

If I look at TPR then it is special cased already and it does:

   case APIC_TASKPRI:
report_tpr_access(apic, false|true);
/* fall thru */

And the fall through is using the general accessor for all not special
cased registers.

So all you have to do is 

   case APIC_TASKPRI:
report_tpr_access(apic, false|true);
+   return access_mapped_reg(...);

Instead of the fall through.

So there is no synchronizing back and forth problem simply because you
already have a special case for that register.

I know you'll argue that the tpr reporting is a special hack for
windows guests, at least that's what the changelog tells.

But even if we have a few more registers accessed by hardware and if
they do not require a special casing, I really doubt that the overhead
of special casing those few regs will be worse than not having the
obvious optimization in place.

And looking deeper it's a total non issue. The apic mapping is 4k. The
register stride is strictly 0x10. That makes a total of 256 possible
registers.

So now you have two possibilites:

1) Create a 256 bit == 64byte bitfield to select the one or the other
   representation.

   The overhead of checking the bit is not going to be observable.

2) Create a 256 function pointer array == 2k resp. 1k (64 / 32bit)

   That's not a lot of memory even if you have to maintain two
   separate variants for read and write, but it allows you to get rid
   of the already horribly compiled switch case in apic_read/write and
   you'll get the optional stuff like report_tpr_access() w/o extra
   conditionals just for free.

   An extra goodie is that you can catch any access to a non existing
   register which you now just silently ignore.  And that allows you
   to react on any future hardware oddities without adding a single
   runtime conditional.

   This is stricly x86 and x86 is way better at dealing with indirect
   calls than with the mess gcc creates compiling those switch case
   constructs.

   So I'd go for that and rather spend the memory and the time in
   setting up the function pointers on init/ioctl than dealing with
   the inconsistency of HW/SW representation with magic hacks.

Thanks,

tglx

Re: [PATCH] kvm: optimize ISR lookups

2012-05-23 Thread Thomas Gleixner

On Wed, 23 May 2012, Michael S. Tsirkin wrote:

 On Wed, May 23, 2012 at 10:10:27PM +0200, Thomas Gleixner wrote:
  Replying on a still polite reminder with a sloppy I just took what's
  there and implemeted the optimization which I was tasked with is even
  more of an offense.
 
 Ow.  That 'not my fault' line was a joke.

Which I obviously missed to get after spending quite some time to grok
the whole thing.

I still don't get it. Must be a cultural or an age thing or both.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-22 Thread Thomas Gleixner

On Tue, 22 May 2012, Avi Kivity wrote:
 On 05/22/2012 12:04 AM, Thomas Gleixner wrote:
  The only justification for having the same layout as the actual
  hardware is when you are going to map the memory into the guest space,
  which is not the case here.

 The APIC page is in fact mapped  to the hardware (not the guest, but vmx
 microcode does access it).  Only one register, the TPR, is ever used. 
 It's possible to re-layout the data structure so that the TPR stays in
 the same place while everything else becomes contiguous, but we'll have
 to do it again if the hardware starts mapping more registers.

I would avoid that by having a compressed version which reflects the
SW state and the mapped one which allows the vmx microcode to fiddle
with the TPR. If you need more registers in the HW page then you don't
have to worry about the layout and just have a proper accessor for
that.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

On Mon, 21 May 2012, Michael S. Tsirkin wrote:

 We perform ISR lookups twice: during interrupt
 injection and on EOI. Typical workloads only have
 a single bit set there. So we can avoid ISR scans by
 1. counting bits as we set/clear them in ISR
 2. if count is 1, caching the vector number
 3. if count != 1, invalidating the cache
 
 The real purpose of this is enabling PV EOI
 which needs to quickly validate the vector.
 But non PV guests also benefit.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  arch/x86/kvm/lapic.c |   51 -
  arch/x86/kvm/lapic.h |2 +
  2 files changed, 51 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 93c1574..232950a 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -107,6 +107,16 @@ static inline void apic_clear_vector(int vec, void 
 *bitmap)
   clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
  
 +static inline int __apic_test_and_set_vector(int vec, void *bitmap)
 +{
 + return __test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 +}
 +
 +static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
 +{
 + return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 +}
 +
  static inline int apic_hw_enabled(struct kvm_lapic *apic)
  {
   return (apic)-vcpu-arch.apic_base  MSR_IA32_APICBASE_ENABLE;
 @@ -210,6 +220,16 @@ static int find_highest_vector(void *bitmap)
   return fls(word[word_offset  2]) - 1 + (word_offset  5);
  }
  
 +static u8 count_vectors(void *bitmap)
 +{
 + u32 *word = bitmap;
 + int word_offset;
 + u8 count = 0;
 + for (word_offset = 0; word_offset  MAX_APIC_VECTOR  5; ++word_offset)
 + count += hweight32(word[word_offset  2]);
 + return count;
 +}

Why do you need to reimplement bitmap_weight()?

Because your bitmap is not a bitmap, but a set of 32bit registers
which have an offset of 4 words each. I really had to twist my brain
around this function and the test_and_clr/set ones above just because
the name bitmap is so horribly misleading. Add an extra bonus for
using void pointers.

Yes, I know, the existing code is full of this, but that's not an
excuse to add more of it.

This emulation is just silly. Nothing in an emulation where the access
to the emulated hardware is implemented with vmexits is requiring a
1:1 memory layout. It's completely irrelevant whether the hardware has
an ISR regs offset of 0x10 or not. Nothing prevents the emulation to
use a consecutive bitmap for the vector bits instead of reimplementing
a boatload of bitmap operations.

The only justification for having the same layout as the actual
hardware is when you are going to map the memory into the guest space,
which is not the case here.

And if you look deeper, then you'll notice that _ALL_ APIC registers
are on a 16 byte boundary (thanks Peter for pointing that out).

So it's even more silly to have a 1:1 representation instead of
implementing the default emulated apic_read/write functions to access
(offset  2).

And of course, that design decision causes lookups to be slow. It's
way faster to scan a consecutive bitmap than iterating through eight
32bit words with an offset of 0x10, especially on a 64 bit host.

  static inline int apic_test_and_set_irr(int vec, struct kvm_lapic *apic)
  {
   apic-irr_pending = true;
 @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct 
 kvm_lapic *apic)
   apic-irr_pending = true;
  }
  
 +static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
 +{
 + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR))
 + ++apic-isr_count;
 + ASSERT(apic-isr_count  MAX_APIC_VECTOR);

I'm really curious what you observed when defining DEBUG in that file.

Clearly you never did.

 + if (likely(apic-isr_count == 1))
 + apic-isr_cache = vec;
 + else
 + apic-isr_cache = -1;
 +}
 +
 +static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
 +{
 + if (__apic_test_and_clear_vector(vec, apic-regs + APIC_ISR))
 + --apic-isr_count;
 + ASSERT(apic-isr_count  0);

Ditto.

   result = find_highest_vector(apic-regs + APIC_ISR);
   ASSERT(result == -1 || result = 16);

And obviously none of the folks who added this gem bothered to define
DEBUG either.

So please instead of working around horrid design decisions and adding
more mess to the existing one, can we please cleanup the stuff first
and then decide whether it's worth to add the extra magic?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

On Tue, 22 May 2012, Michael S. Tsirkin wrote:

 On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote:
   @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, struct 
   kvm_lapic *apic)
 apic-irr_pending = true;
}

   +static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
   +{
   + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR))
   + ++apic-isr_count;
   + ASSERT(apic-isr_count  MAX_APIC_VECTOR);
  
  I'm really curious what you observed when defining DEBUG in that file.
  
  Clearly you never did.
 
 Sorry :(
 Yes clearly silly, thanks for pointing this out.

That's all you have a reply for? That's the least of the problems 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

Michael,

On Tue, 22 May 2012, Michael S. Tsirkin wrote:
 On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote:
  On Tue, 22 May 2012, Michael S. Tsirkin wrote:
  
   On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote:
 @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, 
 struct kvm_lapic *apic)
   apic-irr_pending = true;
  }
  
 +static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
 +{
 + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR))
 + ++apic-isr_count;
 + ASSERT(apic-isr_count  MAX_APIC_VECTOR);

I'm really curious what you observed when defining DEBUG in that file.

Clearly you never did.
   
   Sorry :(
   Yes clearly silly, thanks for pointing this out.
  
  That's all you have a reply for? That's the least of the problems 
 
 Others are not my fault :)

Interesting. The other changes you added are not your fault?

So you provided a patch which you didn't author completely. You merily
added the bogus ASSERTs, right ?

Can you please explain why there is only your SOB on that patch?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

On Tue, 22 May 2012, Michael S. Tsirkin wrote:
 On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote:
  On Tue, 22 May 2012, Michael S. Tsirkin wrote:
  
   On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote:
 @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, 
 struct kvm_lapic *apic)
   apic-irr_pending = true;
  }
  
 +static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
 +{
 + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR))
 + ++apic-isr_count;
 + ASSERT(apic-isr_count  MAX_APIC_VECTOR);

I'm really curious what you observed when defining DEBUG in that file.

Clearly you never did.
   
   Sorry :(
   Yes clearly silly, thanks for pointing this out.
  
  That's all you have a reply for? That's the least of the problems 
 
 Others are not my fault :)
 
 Seriously, if Avi/Marcelo want to rewrite the ISR emulation

Interesting POV, really.

Did you ever notice that the kernel is a collaborative effort and not
controlled by Avi/Marcelo?

Did you ever notice that arch/x86/kvm is part of arch/x86? 

And if you did, did you notice that there are maintainers responsible
for that code base?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

On Tue, 22 May 2012, Michael S. Tsirkin wrote:
 On Tue, May 22, 2012 at 12:14:18AM +0200, Thomas Gleixner wrote:
  On Tue, 22 May 2012, Michael S. Tsirkin wrote:
  
   On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote:
 @@ -242,6 +262,25 @@ static inline void apic_clear_irr(int vec, 
 struct kvm_lapic *apic)
   apic-irr_pending = true;
  }
  
 +static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
 +{
 + if (!__apic_test_and_set_vector(vec, apic-regs + APIC_ISR))
 + ++apic-isr_count;
 + ASSERT(apic-isr_count  MAX_APIC_VECTOR);

I'm really curious what you observed when defining DEBUG in that file.

Clearly you never did.
   
   Sorry :(
   Yes clearly silly, thanks for pointing this out.
  
  That's all you have a reply for? That's the least of the problems 
 
 Others are not my fault :)
 
 Seriously, if Avi/Marcelo want to rewrite the ISR emulation
 I can help. If they want to keep it 1:1 with hardware
 then what I wrote seems the only way.

Seriously. You submitted a code monkey patch, solving a problem by
curing the symptom, but not the root cause.

And then you hide behind Avi and Marcelo?

Did you ever think about the real problem of that lapic emulation?

Let's assume you did and it occured to you that the whole thing is
wrong and backwards, then you still insist on adding more bullshit to
that file instead of optimizing and fixing it in the first place?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: optimize ISR lookups

2012-05-21 Thread Thomas Gleixner

On Tue, 22 May 2012, Michael S. Tsirkin wrote:
 On Mon, May 21, 2012 at 11:04:25PM +0200, Thomas Gleixner wrote:
  On Mon, 21 May 2012, Michael S. Tsirkin wrote:
   +static u8 count_vectors(void *bitmap)
   +{
   + u32 *word = bitmap;
   + int word_offset;
   + u8 count = 0;
   + for (word_offset = 0; word_offset  MAX_APIC_VECTOR  5; ++word_offset)
   + count += hweight32(word[word_offset  2]);
   + return count;
   +}
  
  Why do you need to reimplement bitmap_weight()?
  
  Because your bitmap is not a bitmap,
 
 It's an existing bitmap. It's not my bitmap :)

Dammit. You added a function:

+static u8 count_vectors(void *bitmap)

So it's YOUR bitmap. It's YOUR fault that you copied mindlessly the
existing crap.

And you just copied it to push performance without even thinking about
the underlying problems.

And now you expect me to accept this just because someone else missed
to use his brain when implementing the exisiting nonsense ?

  Yes, I know, the existing code is full of this, but that's not an
  excuse to add more of it.
 
 There's no other way to use existing code. If current code
 is reworked to use bitmap.h then my patch can use it too.

Then sit down and rework the existing code instead of insisting on
something which makes the code worse than it is already.

  This emulation is just silly. Nothing in an emulation where the access
  to the emulated hardware is implemented with vmexits is requiring a
  1:1 memory layout. It's completely irrelevant whether the hardware has
  an ISR regs offset of 0x10 or not. Nothing prevents the emulation to
  use a consecutive bitmap for the vector bits instead of reimplementing
  a boatload of bitmap operations.
  
  The only justification for having the same layout as the actual
  hardware is when you are going to map the memory into the guest space,
  which is not the case here.
 
 I think the reason is __apic_read which now simply copies the registers
 out to guest, this code will become less straight-forward if it's not
 1:1.

Oh yes. You didn't even read the few lines below, which explain what's
wrong and why the existing code is less straight forward than
optimized code.
 
  And if you look deeper, then you'll notice that _ALL_ APIC registers
  are on a 16 byte boundary (thanks Peter for pointing that out).
  
  So it's even more silly to have a 1:1 representation instead of
  implementing the default emulated apic_read/write functions to access
  (offset  2).
  
  And of course, that design decision causes lookups to be slow.
 
 Yes, it might be one of the reasons why my patch helps so
 much: it adds a cache in front of this data structure.

Obviously you didn't even try to answer my obresvations/questions, you
are just justfying your hackery.

 So what you propose is in fact to rework apic registers at least for
 ISR,IRR,TMR to use a bitmap.

This is the final confirmation that you never tried to understand my
reply. Hint: s/at least.*//

 I am fine with this suggestion but would like some feedback from kvm
 maintainers on where they want to go before I spend time on that.

Are the kvm maintainers controlling your common sense or what ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Thomas Gleixner

On Mon, 7 May 2012, Ingo Molnar wrote:
 * Avi Kivity a...@redhat.com wrote:
 
   PS: Nikunj had experimented that pv-flush tlb + 
   paravirt-spinlock is a win on PLE where only one of them 
   alone could not prove the benefit.
  
  I'd like to see those numbers, then.
  
  Ingo, please hold on the kvm-specific patches, meanwhile.
 
 I'll hold off on the whole thing - frankly, we don't want this 
 kind of Xen-only complexity. If KVM can make use of PLE then Xen 
 ought to be able to do it as well.
 
 If both Xen and KVM makes good use of it then that's a different 
 matter.

Aside of that, it's kinda strange that a dude named Nikunj is
referenced in the argument chain, but I can't find him on the CC list.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-02 Thread Thomas Gleixner

On Sun, 1 Apr 2012, Avi Kivity wrote:
 On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
  On Fri, 30 Mar 2012, H. Peter Anvin wrote:
 
   What is the current status of this patchset?  I haven't looked at it too
   closely because I have been focused on 3.4 up until now...
 
  The real question is whether these heuristics are the correct approach
  or not.
 
  If I look at it from the non virtualized kernel side then this is ass
  backwards. We know already that we are holding a spinlock which might
  cause other (v)cpus going into eternal spin. The non virtualized
  kernel solves this by disabling preemption and therefor getting out of
  the critical section as fast as possible,
 
  The virtualization problem reminds me a lot of the problem which RT
  kernels are observing where non raw spinlocks are turned into
  sleeping spinlocks and therefor can cause throughput issues for non
  RT workloads.
 
  Though the virtualized situation is even worse. Any preempted guest
  section which holds a spinlock is prone to cause unbound delays.
 
  The paravirt ticketlock solution can only mitigate the problem, but
  not solve it. With massive overcommit there is always a way to trigger
  worst case scenarious unless you are educating the scheduler to cope
  with that.
 
  So if we need to fiddle with the scheduler and frankly that's the only
  way to get a real gain (the numbers, which are achieved by this
  patches, are not that impressive) then the question arises whether we
  should turn the whole thing around.
 
  I know that Peter is going to go berserk on me, but if we are running
  a paravirt guest then it's simple to provide a mechanism which allows
  the host (aka hypervisor) to check that in the guest just by looking
  at some global state.
 
  So if a guest exits due to an external event it's easy to inspect the
  state of that guest and avoid to schedule away when it was interrupted
  in a spinlock held section. That guest/host shared state needs to be
  modified to indicate the guest to invoke an exit when the last nested
  lock has been released.
 
 Interesting idea (I think it has been raised before btw, don't recall by
 who).

Someoen posted a pointer to that old thread.

 One thing about it is that it can give many false positives.  Consider a
 fine-grained spinlock that is being accessed by many threads.  That is,
 the lock is taken and released with high frequency, but there is no
 contention, because each vcpu is accessing a different instance.  So the
 host scheduler will needlessly delay preemption of vcpus that happen to
 be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there,
but per cpu stuff might be worth annotating to avoid this.
 
 A second issue may happen with a lock that is taken and released with
 high frequency, with a high hold percentage.  The host scheduler may
 always sample the guest in a held state, leading it to conclude that
 it's exceeding its timeout when in fact the lock is held for a short
 time only.

Well, no. You can avoid that.

hostVCPU
spin_lock()
 modify_global_state()
exit
check_global_state()
mark_global_state()
reschedule vcpu

spin_unlock()
 check_global_state()
  trigger_exit()

So when an exit occures in the locked section, then the host can mark
the global state to tell the guest to issue a trap on unlock.
 
  Of course this needs to be time bound, so a rogue guest cannot
  monopolize the cpu forever, but that's the least to worry about
  problem simply because a guest which does not get out of a spinlocked
  region within a certain amount of time is borked and elegible to
  killing anyway.
 
 Hopefully not killing!  Just because a guest doesn't scale well, or even
 if it's deadlocked, doesn't mean it should be killed.  Just preempt it.

:)
 
  Thoughts ?
 
 It's certainly interesting.  Maybe a combination is worthwhile - prevent
 lockholder preemption for a short period of time AND put waiters to
 sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock
ordering support to avoid that guests are completely starved.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-03-30 Thread Thomas Gleixner

On Fri, 30 Mar 2012, H. Peter Anvin wrote:

 What is the current status of this patchset?  I haven't looked at it too
 closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach
or not.

If I look at it from the non virtualized kernel side then this is ass
backwards. We know already that we are holding a spinlock which might
cause other (v)cpus going into eternal spin. The non virtualized
kernel solves this by disabling preemption and therefor getting out of
the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT
kernels are observing where non raw spinlocks are turned into
sleeping spinlocks and therefor can cause throughput issues for non
RT workloads.

Though the virtualized situation is even worse. Any preempted guest
section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but
not solve it. With massive overcommit there is always a way to trigger
worst case scenarious unless you are educating the scheduler to cope
with that.

So if we need to fiddle with the scheduler and frankly that's the only
way to get a real gain (the numbers, which are achieved by this
patches, are not that impressive) then the question arises whether we
should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running
a paravirt guest then it's simple to provide a mechanism which allows
the host (aka hypervisor) to check that in the guest just by looking
at some global state.

So if a guest exits due to an external event it's easy to inspect the
state of that guest and avoid to schedule away when it was interrupted
in a spinlock held section. That guest/host shared state needs to be
modified to indicate the guest to invoke an exit when the last nested
lock has been released.

Of course this needs to be time bound, so a rogue guest cannot
monopolize the cpu forever, but that's the least to worry about
problem simply because a guest which does not get out of a spinlocked
region within a certain amount of time is borked and elegible to
killing anyway.

Thoughts ?

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: kvmclock: abstract save/restore sched_clock_state

2012-03-01 Thread Thomas Gleixner

On Tue, 7 Feb 2012, Marcelo Tosatti wrote:
 
 Upon resume from hibernation, CPU 0's hvclock area contains the old
 values for system_time and tsc_timestamp. It is necessary for the
 hypervisor to update these values with uptodate ones before the CPU uses
 them.
 
 Abstract TSC's save/restore sched_clock_state functions and use
 restore_state to write to KVM_SYSTEM_TIME MSR, forcing an update.
 
 Fixes suspend-to-disk with kvmclock.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
 index 15d9915..c91e8b9 100644
 --- a/arch/x86/include/asm/tsc.h
 +++ b/arch/x86/include/asm/tsc.h
 @@ -61,7 +61,7 @@ extern void check_tsc_sync_source(int cpu);
  extern void check_tsc_sync_target(void);
  
  extern int notsc_setup(char *);
 -extern void save_sched_clock_state(void);
 -extern void restore_sched_clock_state(void);
 +extern void tsc_save_sched_clock_state(void);
 +extern void tsc_restore_sched_clock_state(void);
  
  #endif /* _ASM_X86_TSC_H */
 diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
 index 5d0afac..baaca8d 100644
 --- a/arch/x86/include/asm/x86_init.h
 +++ b/arch/x86/include/asm/x86_init.h
 @@ -162,6 +162,8 @@ struct x86_cpuinit_ops {
   * @is_untracked_pat_range   exclude from PAT logic
   * @nmi_init enable NMI on cpus
   * @i8042_detect pre-detect if i8042 controller exists
 + * @save_sched_clock_state:  save state for sched_clock() on suspend
 + * @restore_sched_clock_state:   restore state for sched_clock() on 
 resume
   */
  struct x86_platform_ops {
   unsigned long (*calibrate_tsc)(void);
 @@ -173,6 +175,8 @@ struct x86_platform_ops {
   void (*nmi_init)(void);
   unsigned char (*get_nmi_reason)(void);
   int (*i8042_detect)(void);
 + void (*save_sched_clock_state)(void);
 + void (*restore_sched_clock_state)(void);
  };
  
  struct pci_dev;
 diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
 index ca4e735..57e6b78 100644
 --- a/arch/x86/kernel/kvmclock.c
 +++ b/arch/x86/kernel/kvmclock.c
 @@ -136,6 +136,15 @@ int kvm_register_clock(char *txt)
   return ret;
  }
  
 +void kvm_save_sched_clock_state(void)

static ?

 +{
 +}
 +
 +void kvm_restore_sched_clock_state(void)

Ditto

 +{
 + kvm_register_clock(primary cpu clock, resume);
 +}
 +

Otherwise: Reviewed-by: Thomas Gleixner t...@linutronix.de
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Introduce x86_cpuinit.early_percpu_clock_init hook

2012-02-07 Thread Thomas Gleixner

On Tue, 7 Feb 2012, Igor Mammedov wrote:

 When kvm guest uses kvmclock, it may hang on vcpu hot-plug.
 This is caused by an overflow in pvclock_get_nsec_offset,
 
 u64 delta = tsc - shadow-tsc_timestamp;
 
 which in turn is caused by an undefined values from percpu
 hv_clock that hasn't been initialized yet.
 Uninitialized clock on being booted cpu is accessed from
start_secondary
 - smp_callin
   -  smp_store_cpu_info
 - identify_secondary_cpu
   - mtrr_ap_init
 - mtrr_restore
   - stop_machine_from_inactive_cpu
 - queue_stop_cpus_work
   ...
 - sched_clock
   - kvm_clock_read
 which is well before x86_cpuinit.setup_percpu_clockev call in
 start_secondary, where percpu clock is initialized.
 
 This patch introduces a hook that allows to setup/initialize
 per_cpu clock early and avoid overflow due to reading
   - undefined values
   - old values if cpu was offlined and then onlined again
 
 Another possible early user of this clock source is ftrace that
 accesses it to get timestamps for ring buffer entries. So if
 mtrr_ap_init is moved from identify_secondary_cpu to past
 x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace
 may cause the same overflow/hang on cpu hot-plug anyway.
 
 More complete description of the problem:
   https://lkml.org/lkml/2012/2/2/101
 
 Credits to Marcelo Tosatti mtosa...@redhat.com for hook idea.
 
 Signed-off-by: Igor Mammedov imamm...@redhat.com
 ---
  arch/x86/include/asm/x86_init.h |2 ++
  arch/x86/kernel/kvmclock.c  |4 +---
  arch/x86/kernel/smpboot.c   |1 +
  arch/x86/kernel/x86_init.c  |1 +
  4 files changed, 5 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
 index 517d476..5d0afac 100644
 --- a/arch/x86/include/asm/x86_init.h
 +++ b/arch/x86/include/asm/x86_init.h
 @@ -145,9 +145,11 @@ struct x86_init_ops {
  /**
   * struct x86_cpuinit_ops - platform specific cpu hotplug setups
   * @setup_percpu_clockev:set up the per cpu clock event device
 + * @early_percpu_clock_init: early init of the per cpu clock event device

You initialize the per cpu clock, not the per cpu clock event
device. The latter is still initialized via setup_percpu_clockev().

Otherwise 

Acked-by: Thomas Gleixner t...@linutronix.de

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-17 Thread Thomas Gleixner

On Fri, 17 Dec 2010, Jan Kiszka wrote:
 Am 16.12.2010 21:26, Jan Kiszka wrote:
  Am 16.12.2010 14:13, Thomas Gleixner wrote:
  On Mon, 13 Dec 2010, Jan Kiszka wrote:
  + if (old_action  (old_action-flags  IRQF_ADAPTIVE) 
  + !(desc-irq_data.drv_status  IRQS_SHARED)) {
  + /*
  +  * Signal the old handler that is has to switch to shareable
  +  * handling mode. Disable the line to avoid any conflict with
  +  * a real IRQ.
  +  */
  + disable_irq(irq);
 
  This is weird, really. I thought you wanted to avoid waiting for the
  threaded handler to finish if it's on the fly. So this should be
  disable_irq_nosync() or did you change your mind ?
  
  No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there
  might be another IRQ in flight. The handler that is called due to a real
  IRQ might misinterpret MAKE_SHAREABLE as there is no real event and
  perform the wrong steps (at least the current implementation for KVM would).
 
 Actually, the requirement we have to fulfill here is to avoid that the
 hardirq handler sees !SHARED while the threaded one reads SHARED. To
 achieve this without disabling the line, I'm still searching for a way
 to couple the sharing state of associated hard and threaded handler runs
 - but I think there is no reliable association, is there?

Unfortunately not. So the only way to solve that is disabling the
interrupt which makes sure that all handlers have completed.

OTOH, if we have to disable anyway, then we could simply keep it
disabled across the installation of a new handler. That would make the
notification business go away, wouldn't it ?

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-17 Thread Thomas Gleixner

On Fri, 17 Dec 2010, Jan Kiszka wrote:
 Am 17.12.2010 11:23, Thomas Gleixner wrote:
  OTOH, if we have to disable anyway, then we could simply keep it
  disabled across the installation of a new handler. That would make the
  notification business go away, wouldn't it ?
 
 No, the notification is still necessary in case the registered handler
 keeps the line off after returning from both hard and threaded handler.

And how should that happen? If it is in oneshot mode then the line is
reenabled when the thread handler returns.

Thanks,

tglx

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-17 Thread Thomas Gleixner

On Fri, 17 Dec 2010, Jan Kiszka wrote:

 Am 17.12.2010 11:41, Thomas Gleixner wrote:
  On Fri, 17 Dec 2010, Jan Kiszka wrote:
  Am 17.12.2010 11:23, Thomas Gleixner wrote:
  OTOH, if we have to disable anyway, then we could simply keep it
  disabled across the installation of a new handler. That would make the
  notification business go away, wouldn't it ?
 
  No, the notification is still necessary in case the registered handler
  keeps the line off after returning from both hard and threaded handler.
  
  And how should that happen? If it is in oneshot mode then the line is
  reenabled when the thread handler returns.
 
 disable_irq_nosync is called by the handler before returning. And it's
 the handler's job to revert this, properly synchronizing it internally.

disable_irq_nosync() is really the worst thing to do. That's simply
not going to work without a lot of fuglyness.

What about the following:

primary_handler()
{
if (!shared)
return IRQ_WAKE_THREAD;

spin_lock(dev-irq_lock);

if (from_my_device() || dev-irq_thread_waiting) {
mask_dev();
dev-masked = true;
ret = IRQ_WAKE_THREAD;
} else
ret = IRQ_NONE;

spin_unlock(dev-irq_lock);
return ret;
}

check_timeout()
{
if (dev-irq_active  wait_longer())
return IRQ_WAKE_THREAD;
return IRQ_HANDLED;
}

unmask_dev_if_necessary()
{
if (dev-masked  dev-irq_active)
umask_dev();
}

threaded_handler()
{
if (!dev-irq_thread_waiting) {
spin_lock_irq(dev-irq_lock);
wake_user = do_magic_stuff_with_the_dev();
dev-irq_thread_waiting = wake_user;
spin_unlock(dev-irq_lock);
if (wake_user)
wake_up(user);
}

if (!dev-irq_thread_waiting) {
spin_lock_irq(dev-irq_lock);
unmask_dev_if_necessary();
spin_unlock(dev-irq_lock);
return IRQ_HANDLED;
}

/*
 * Wait for user space to complete. Timeout is to
 * avoid starvation of the irq line when
 * something goes wrong
 */
wait_for_completion_timeout(dev-compl, SENSIBLE_TIMEOUT);

spin_lock_irq(dev-irq_lock);
if (timedout) {
mask_dev();
dev-masked = true;
/*
 * Leave dev-irq_thread_waiting untouched and let
 * the core code reschedule us when check_timeout
 * decides it's worth to wait. In any case we leave
 * the device masked at the device level, so we don't
 * cause an interrupt storm.
 */
ret = check_timeout();
} else {
unmask_dev_if_necessary();
dev-irq_thread_waiting = false;
ret = IRQ_HANDLED;
}
spin_unlock(dev-irq_lock);
return ret;
}

userspace_complete()
{
complete(dev-irq_compl);
}

Your aproach with disable_irq_nosync() is completely flawed, simply
because you try to pretend that your interrupt handler is done, while
it is not done at all. The threaded interrupt handler is done when
user space completes. Everything else is just hacking around the
problem and creates all that nasty transitional problems.

The above code does not have them at all. The threaded handler does
not care at all about the dev_id shared state encoding and the state
transitions. It merily cares about the device internal status
dev-masked. Everything else is handled by the genirq code and the
litte status check in the primary handler.

Neither does the user space completion care about it. It just
completes and is completely oblivious of the irq line state/mode. And
really, the user space part should not care at all. It can set some
status before calling complete(), but that's driver specific stuff and
has nothing to do with the irq handling magic.

It requires a few not too intrusive modifications to the genirq code:

   - Rescheduling the thread handler on IRQ_WAKE_THREAD
   - Changing the oneshot finalizing a bit
   - Adding the status manipulations for request/free_irq

Now you might argue that the timeout is ugly, but I don't think it's
ugly at all. You need it anyway in case user space failed
completely. And coming back after 100ms to let the genirq code handle
a disable_irq() or synchronize_irq() is a reasonable request, it's the
error/corner case at all. If there is code which installs/removes an
interrupt handler on the same line every 5ms, then this code becomes
rightfully blocked out for 100ms or such.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-17 Thread Thomas Gleixner

On Fri, 17 Dec 2010, Jan Kiszka wrote:
 Am 17.12.2010 16:25, Thomas Gleixner wrote:
  Your aproach with disable_irq_nosync() is completely flawed, simply
  because you try to pretend that your interrupt handler is done, while
  it is not done at all. The threaded interrupt handler is done when
  user space completes. Everything else is just hacking around the
  problem and creates all that nasty transitional problems.
 
 disable_irq_nosync is the pattern currently used in KVM, it's nothing
 new in fact.

That does not make it less flawed :)

 The approach looks interesting but requires separate code for
 non-PCI-2.3 devices, i.e. when we have no means to mask at device level.

Why? You can have the same code, you just can't request COND_ONESHOT
handlers for it, so it needs unshared ONESHOT or it won't work at all,
no matter what approach you chose. No device level mask, no sharing,
it's that simple.

 Further drawbacks - unless I missed something on first glance:
 
 - prevents any future optimizations that would work without IRQ thread
   ping-pong (ie. once we allow guest IRQ injection from hardirq context
   for selected but typical setups)
 - two additional, though light-weight, context switches on each
   interrupt completion

The drawback of these two points is way less than the horror which you
need to introduce to do the whole async handler disable, userspace
enable dance. Robust and simple solutions really a preferred over
complex and fragile horror which has a questionable runtime benefit.

 - continuous polling if user space decides to leave the interrupt
   unhandled (e.g. because the virtual IRQ line is masked)
 
That should be a solvable problem.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-16 Thread Thomas Gleixner

On Mon, 13 Dec 2010, Jan Kiszka wrote:
 + if (old_action  (old_action-flags  IRQF_ADAPTIVE) 
 + !(desc-irq_data.drv_status  IRQS_SHARED)) {
 + /*
 +  * Signal the old handler that is has to switch to shareable
 +  * handling mode. Disable the line to avoid any conflict with
 +  * a real IRQ.
 +  */
 + disable_irq(irq);

This is weird, really. I thought you wanted to avoid waiting for the
threaded handler to finish if it's on the fly. So this should be
disable_irq_nosync() or did you change your mind ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-15 Thread Thomas Gleixner

On Wed, 15 Dec 2010, Jan Kiszka wrote:
 Am 15.12.2010 09:05, Thomas Gleixner wrote:
  On Wed, 15 Dec 2010, Jan Kiszka wrote:
  
  Am 14.12.2010 22:46, Thomas Gleixner wrote:
  On Mon, 13 Dec 2010, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
   chip_bus_lock(desc);
   retval = __setup_irq(irq, desc, action);
   chip_bus_sync_unlock(desc);
   
  -if (retval)
  +if (retval) {
  +if (desc-action  !desc-action-next)
  +desc-irq_data.drv_status = ~IRQS_SHARED;
 
  This is redundant. IRQS_SHARED gets set in a code path where all
  checks are done already.
 
  Nope, it's also set before entry of __setup_irq in case we call an
  IRQF_ADAPTIVE handler.
 
  We need to set it that early as we may race with IRQ events for the
  already registered handler happening between the sharing notification
  and the actual registration of the second handler.
  
  Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the
  notification.
 
 For notification, yes. But we need SHARED once we reenable the line
 after the notification.

Darn. Will think more about it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-15 Thread Thomas Gleixner

On Wed, 15 Dec 2010, Jan Kiszka wrote:
 Am 14.12.2010 21:54, Thomas Gleixner wrote:
  On Mon, 13 Dec 2010, Jan Kiszka wrote:
  @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, 
  void *dev_id)
 /* Make sure it's not being used on another CPU: */
 synchronize_irq(irq);
   
  +  if (single_handler)
  +  desc-irq_data.drv_status = ~IRQS_SHARED;
  +
  
  What's the reason to clear this flag outside of the desc-lock held
  region.
 
 We need to synchronize the irq first before clearing the flag.
 
 The problematic scenario behind this: An IRQ started in shared mode,
 this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
 before calling into the threaded handler. And that handler may now think
 that the line is still masked as IRQS_SHARED is set.

That should read not set I guess. Hmm, needs more thoughts :(
 
  I need this status for other purposes as well, where I
  definitely need serialization.
 
 Well, two options: wrap all bit manipulations with desc-lock
 acquisition/release or turn drv_status into an atomic. I don't know what
 your plans with drv_status are, so...

Some bits for irq migration and other stuff, which allows us to avoid
fiddling with irqdesc in the drivers.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-15 Thread Thomas Gleixner

On Wed, 15 Dec 2010, Jan Kiszka wrote:

 Am 15.12.2010 14:04, Thomas Gleixner wrote:
  On Wed, 15 Dec 2010, Jan Kiszka wrote:
  Am 14.12.2010 21:54, Thomas Gleixner wrote:
  On Mon, 13 Dec 2010, Jan Kiszka wrote:
  @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int 
  irq, void *dev_id)
   /* Make sure it's not being used on another CPU: */
   synchronize_irq(irq);
   
  +if (single_handler)
  +desc-irq_data.drv_status = ~IRQS_SHARED;
  +
 
  What's the reason to clear this flag outside of the desc-lock held
  region.
 
  We need to synchronize the irq first before clearing the flag.
 
  The problematic scenario behind this: An IRQ started in shared mode,
  this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
  before calling into the threaded handler. And that handler may now think
  that the line is still masked as IRQS_SHARED is set.
  
  That should read not set I guess.
 
 Can't remember who wrote this, but that guy might have been too tired
 for clear sentences: Yes, of course, we could run into troubles, if
 IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
 and threaded handler.

Right.

As a side note, the current implementation requires that you lookup
irq_data.drv_status for every invocation of the handler or have a
reference to irq_data.drv_status somewhere locally, which I don't like
either.

I have an evil and nasy idea how to avoid that, need to look how ugly
it gets. Worst case we need to go back to that notification thing
which I wanted really avoid in the first place.
 
Though I like the register_mutex idea which came out of this a lot as
it allows us to reduce desc-lock held and interrupt disabled regions
quite nicely.

/me goes back to stare at the code

  Hmm, needs more thoughts :(
 
 Be warned, might be painful.

Bah, my brain became pain resistant when I started hacking that code.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-15 Thread Thomas Gleixner

On Wed, 15 Dec 2010, Jan Kiszka wrote:

 Am 15.12.2010 14:04, Thomas Gleixner wrote:
  On Wed, 15 Dec 2010, Jan Kiszka wrote:
  Am 14.12.2010 21:54, Thomas Gleixner wrote:
  On Mon, 13 Dec 2010, Jan Kiszka wrote:
  @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int 
  irq, void *dev_id)
   /* Make sure it's not being used on another CPU: */
   synchronize_irq(irq);
   
  +if (single_handler)
  +desc-irq_data.drv_status = ~IRQS_SHARED;
  +
 
  What's the reason to clear this flag outside of the desc-lock held
  region.
 
  We need to synchronize the irq first before clearing the flag.
 
  The problematic scenario behind this: An IRQ started in shared mode,
  this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
  before calling into the threaded handler. And that handler may now think
  that the line is still masked as IRQS_SHARED is set.
  
  That should read not set I guess.
 
 Can't remember who wrote this, but that guy might have been too tired
 for clear sentences: Yes, of course, we could run into troubles, if
 IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
 and threaded handler.
 
  Hmm, needs more thoughts :(
 
 Be warned, might be painful.

Talking about headache. Your solution above does not prevent that
scenario.

 CPU 0  CPU 1

 synchronize_irq();
hard irq comes in sees shared and unmasks
 clear IRQS_SHARED
thread handler runs and sees !SHARED

Same scenario, just moved by a few lines :)

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-15 Thread Thomas Gleixner

On Wed, 15 Dec 2010, Jan Kiszka wrote:
 Am 15.12.2010 16:41, Thomas Gleixner wrote:
  Talking about headache. Your solution above does not prevent that
  scenario.
  
   CPU 0  CPU 1
  
   synchronize_irq();
  hard irq comes in sees shared and unmasks
 
 Nope, IRQ_ONESHOT is already cleared at that point.

Errm ? It's set. Could you please stop to increase my confusion ? :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-14 Thread Thomas Gleixner

On Mon, 13 Dec 2010, Jan Kiszka wrote:
 +/**
 + *   get_irq_status - read interrupt line status word
 + *   @irq: Interrupt line of the status word
 + *
 + *   This returns the current content of the status word associated with
 + *   the given interrupt line. See IRQS_* flags for details.
 + */
 +unsigned long get_irq_status(unsigned int irq)
 +{
 + struct irq_desc *desc = irq_to_desc(irq);
 +
 + return desc ? desc-irq_data.drv_status : 0;
 +}
 +EXPORT_SYMBOL_GPL(get_irq_status);

We should document that this is a snapshot and in no way serialized
against modifications of drv_status. I'll fix up the kernel doc.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Thomas Gleixner

On Mon, 13 Dec 2010, Jan Kiszka wrote:
 @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, 
 void *dev_id)
   /* Make sure it's not being used on another CPU: */
   synchronize_irq(irq);
  
 + if (single_handler)
 + desc-irq_data.drv_status = ~IRQS_SHARED;
 +

What's the reason to clear this flag outside of the desc-lock held
region. I need this status for other purposes as well, where I
definitely need serialization.

 + mutex_lock(register_lock);
 +
 + old_action = desc-action;
 + if (old_action  (old_action-flags  IRQF_ADAPTIVE) 
 + !(desc-irq_data.drv_status  IRQS_SHARED)) {
 + /*
 +  * Signal the old handler that is has to switch to shareable
 +  * handling mode. Disable the line to avoid any conflict with
 +  * a real IRQ.
 +  */
 + disable_irq(irq);
 + local_irq_disable();
 +
 + desc-irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;

Unserialized access as well. Will think about it.

 + old_action-handler(irq, old_action-dev_id);
 + desc-irq_data.drv_status = ~IRQS_MAKE_SHAREABLE;

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Thomas Gleixner

On Mon, 13 Dec 2010, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
   chip_bus_lock(desc);
   retval = __setup_irq(irq, desc, action);
   chip_bus_sync_unlock(desc);
  
 - if (retval)
 + if (retval) {
 + if (desc-action  !desc-action-next)
 + desc-irq_data.drv_status = ~IRQS_SHARED;

This is redundant. IRQS_SHARED gets set in a code path where all
checks are done already.

To make that more obvious we can set it right before

   raw_spin_unlock_irqrestore(desc-lock, flags);

conditionally on (shared).

That way we can also move the kfree out of the mutex locked section.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-14 Thread Thomas Gleixner

On Mon, 13 Dec 2010, Jan Kiszka wrote:

 This addresses the review comments of the previous round:
  - renamed irq_data::status to drv_status
  - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED
  - fixed signature of get_irq_status (irq is now unsigned int)
  - converted register_lock into a global one
  - fixed critical white space breakage (that I just left in to check if
anyone is actually reading the code, of course...)

Just for the record, you either missed or introduced some new white
space noise :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-13 Thread Thomas Gleixner

On Sun, 12 Dec 2010, Jan Kiszka wrote:
 Am 12.12.2010 18:29, Thomas Gleixner wrote:
  Also we should name it different than status, drv_status perhaps, to
  avoid confusion with the irq_desc status.
 
 OK, will address both in a succeeding round (just waiting for potential
 further comments).

No further comments from my side ATM.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/4] genirq: Inform handler about line sharing state

2010-12-12 Thread Thomas Gleixner

On Sun, 12 Dec 2010, Jan Kiszka wrote:

 From: Jan Kiszka jan.kis...@siemens.com
 
 This enabled interrupt handlers to retrieve the current line sharing state via
 the new interrupt status word so that they can adapt to it.
 
 The switch from shared to exclusive is generally uncritical and can thus be
 performed on demand. However, preparing a line for shared mode may require
 preparational steps of the currently registered handler. It can therefore
 request an ahead-of-time notification via IRQF_ADAPTIVE. The notification
 consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in
 the status word.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/interrupt.h |   10 ++
  include/linux/irqdesc.h   |2 ++
  kernel/irq/irqdesc.c  |2 ++
  kernel/irq/manage.c   |   44 +---
  4 files changed, 55 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
 index 16cdbbf..c6323a2 100644
 --- a/include/linux/interrupt.h
 +++ b/include/linux/interrupt.h
 @@ -55,6 +55,7 @@
   *Used by threaded interrupts which need to keep the
   *irq line disabled until the threaded handler has been run.
   * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
 + * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
   *
   */
  #define IRQF_DISABLED0x0020
 @@ -67,6 +68,7 @@
  #define IRQF_IRQPOLL 0x1000
  #define IRQF_ONESHOT 0x2000
  #define IRQF_NO_SUSPEND  0x4000
 +#define IRQF_ADAPTIVE0x8000
  
  #define IRQF_TIMER   (__IRQF_TIMER | IRQF_NO_SUSPEND)
  
 @@ -126,6 +128,14 @@ struct irqaction {
  
  extern irqreturn_t no_action(int cpl, void *dev_id);
  
 +/*
 + * Driver-readable IRQ line status flags:
 + * IRQS_SHARED - line is shared between multiple handlers
 + * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable
 + */
 +#define IRQS_SHARED  0x0001
 +#define IRQS_MAKE_SHAREABLE  0x0002
 +
  extern unsigned long get_irq_status(unsigned long irq);
  
  #ifdef CONFIG_GENERIC_HARDIRQS
 diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
 index 979c68c..c490e83 100644
 --- a/include/linux/irqdesc.h
 +++ b/include/linux/irqdesc.h
 @@ -29,6 +29,7 @@ struct timer_rand_state;
   * @wait_for_threads:wait queue for sync_irq to wait for threaded 
 handlers
   * @dir: /proc/irq/ procfs entry
   * @name:flow handler name for /proc/interrupts output
 + * @register_lock:   protects registration  release, for unshared-shared

I think we can make that a global mutex. request/free_irq are not
hotpath operations which require a mutex per irq descriptor.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-12 Thread Thomas Gleixner

On Sun, 12 Dec 2010, Jan Kiszka wrote:

 From: Jan Kiszka jan.kis...@siemens.com
 
 This associates a status word with every IRQ descriptor. Drivers can obtain
 its content via get_irq_status(irq). First use case will be propagating the
 interrupt sharing state.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/interrupt.h |2 ++
  include/linux/irq.h   |2 ++
  kernel/irq/manage.c   |   15 +++
  3 files changed, 19 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
 index 79d0c4f..16cdbbf 100644
 --- a/include/linux/interrupt.h
 +++ b/include/linux/interrupt.h
 @@ -126,6 +126,8 @@ struct irqaction {
  
  extern irqreturn_t no_action(int cpl, void *dev_id);
  
 +extern unsigned long get_irq_status(unsigned long irq);
 +
  #ifdef CONFIG_GENERIC_HARDIRQS
  extern int __must_check
  request_threaded_irq(unsigned int irq, irq_handler_t handler,
 diff --git a/include/linux/irq.h b/include/linux/irq.h
 index abde252..5554203 100644
 --- a/include/linux/irq.h
 +++ b/include/linux/irq.h
 @@ -96,6 +96,7 @@ struct msi_desc;
   *   methods, to allow shared chip implementations
   * @msi_desc:MSI descriptor
   * @affinity:IRQ affinity on SMP
 + * @status:  driver-readable status flags (IRQS_*)
   *
   * The fields here need to overlay the ones in irq_desc until we
   * cleaned up the direct references and switched everything over to
 @@ -108,6 +109,7 @@ struct irq_data {
   void*handler_data;
   void*chip_data;
   struct msi_desc *msi_desc;
 + unsigned long   status;

That breaks the current code which has irq_data embedded and shadowed
in irq_desc for migratory reasons until all users are fixed up and the
GENERIC_HARDIRQS_NO_DEPRECATED sections are cleaned up. I know it's
ugly, but that was the only way not to rewrite the world in one go. :)
Just move it below affinity.

Also we should name it different than status, drv_status perhaps, to
avoid confusion with the irq_desc status.

  #ifdef CONFIG_SMP
   cpumask_var_t   affinity;
  #endif

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-04 Thread Thomas Gleixner

On Sat, 4 Dec 2010, Jan Kiszka wrote:

 Besides 3 cleanup patches, this series consists of two major changes.
 The first introduces an interrupt sharing notifier to the genirq
 subsystem. It fires when an interrupt line is about to be use by more
 than one driver or the last but one user called free_irq.
 
 The second major change makes use of this interface in KVM's PCI pass-
 through subsystem. KVM has to keep the interrupt source disabled while
 calling into the guest to handle the event. This can be done at device
 or line level. The former is required to share the interrupt line, the
 latter is an order of magnitude faster (see patch 3 for details).
 
 Beside pass-through support of KVM, further users of the IRQ notifier
 could become VFIO (not yet mainline) and uio_pci_generic which have to
 resolve the same conflict.

Hmm. You basically want to have the following functionality:

If interrupt is shared, then you want to keep the current behaviour:

   disable at line level (IRQF_ONESHOT)
   run handler thread (PCI level masking)
   reenable at line level in irq_finalize_oneshot()
   reenable at PCI level when guest is done

If interrupt is not shared:

   disable at line level (IRQF_ONESHOT)
   run handler thread (no PCI level masking)
   no reenable at line level
   reenable at line level when guest is done

I think the whole notifier approach including the extra irq handlers
plus the requirement to call disable_irq_nosync() from the non shared
handler is overkill. We can be more clever.

The genirq code knows whether you have one or more handler
registered. So we can add IRQF_ONESHOT_UNMASK_SHARED and add a status
field to irq_data (which I was going to do anyway for other
reasons). In that status field you get a bit which says IRQ_MASK_DEVICE.

So with IRQF_ONESHOT_UNMASK_SHARED == 0 we keep the current behaviour.

If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 1 we keep the
current behaviour.

If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 0 then then
irq_finalize_oneshot() simply marks the interrupt disabled (equivalent
to disable_irq_nosync()) and returns.

Now in your primary irq handler you simply check the IRQ_MASK_DEVICE
status flag and decide whether you need to mask at PCI level or not.

Your threaded handler gets the same information via IRQ_MASK_DEVICE so
it can issue the appropriate user space notification depending on that
flag.

This works fully transparent across adding and removing handlers. On
request_irq/free_irq we update the IRQ_MASK_DEVICE flag with the
following logic:

nr_actions  IRQF_ONESHOT_UNMASK_SHARED  IRQ_MASK_DEVICE
1   0   1
1   1   0
1  don't care  1

If interrupts are in flight accross request/free then this change
takes effect when the next interrupt comes in.

No notifiers, no disable_irq_nosync() magic, less and simpler code.

Thoughts ?

 tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-04 Thread Thomas Gleixner

Jan,

On Sat, 4 Dec 2010, Jan Kiszka wrote:

 Am 04.12.2010 11:37, Thomas Gleixner wrote:
  On Sat, 4 Dec 2010, Jan Kiszka wrote:
  If interrupt is shared, then you want to keep the current behaviour:
  
 disable at line level (IRQF_ONESHOT)
 run handler thread (PCI level masking)
 reenable at line level in irq_finalize_oneshot()
 reenable at PCI level when guest is done
 
 If the interrupt is shared, we must mask at PCI level inside the primary
 handler as we also have to support non-threaded users of the same line.
 So we always have a transition line-level - device-level
 masking in a primary handler.

Sorry that left out the hard irq part. Of course it needs to do the
PCI level masking in the primary one.
 
 reduce the latency. So both threaded and non-threaded cases should be
 addressable by any approach.

The oneshot magic should work on non threaded cases as well. Needs
some modifications, but nothing complex.
 
  If interrupts are in flight accross request/free then this change
  takes effect when the next interrupt comes in.
 
 For registration, that might be too late. We need to synchronize on
 in-flight interrupts for that line and then ensure that it gets enabled
 independently of the registered user. That user may have applied
 outdated information, thus would block the line for too long if user
 space decides to do something else.

No, that's really just a corner case when going from one to two
handlers and I don't think it matters much. If you setup a new driver
it's not really important whether that first thing comes in a few ms
later.

Also there is a pretty simple solution for this: The core code knows,
that there is an ONESHOT interrupt in flight, so it simply can call
the primary handler of that device with the appropriate flag set
(maybe an additional one to indicate the transition) and let that deal
with it. Needs some thought vs. locking and races, but that shouldn't
be hard.

 Pulling the effect of disable_irq_nosync into genirq also for threaded
 handling would remove the need for re-registering handlers. That's good.
 But we need to think about the hand-over scenarios again. Maybe
 solvable, but non-trivial.

See above.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] KVMgenirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-04 Thread Thomas Gleixner

On Sat, 4 Dec 2010, Jan Kiszka wrote:
 Am 04.12.2010 15:41, Thomas Gleixner wrote:
  Also there is a pretty simple solution for this: The core code knows,
  that there is an ONESHOT interrupt in flight, so it simply can call
 
 It doesn't synchronize the tail part against the masking in the
 handler(s), that's driver business.

Right, but the core knows from the irq state, that the line is masked
and due to the ONESHOT or whatever feature it knows that it needs to
call the handler.

The other way round shared - exclusive does not matter at all.

  the primary handler of that device with the appropriate flag set
  (maybe an additional one to indicate the transition) and let that deal
  with it. Needs some thought vs. locking and races, but that shouldn't
  be hard.
 
 Yes, I thought about this kind of transition (re-invoking the existing
 handler) already. We do need notification of the switch (at least for
 exclusive-shared) as only the driver can migrate the masking for
 in-flight interrupts.

Right. It needs to set the device level mask in that case. As the
interrupt handler already has the code to do that it's the most
obvious function to call.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] x86: kvm: Convert i8254/i8259 locks to raw_spinlocks

2010-02-23 Thread Thomas Gleixner

On Tue, 23 Feb 2010, Jan Kiszka wrote:

 Thomas Gleixner wrote:
  The i8254/i8259 locks need to be real spinlocks on preempt-rt. Convert
  them to raw_spinlock. No change for !RT kernels.
 
 Doesn't fly for -rt anymore: pic_irq_update runs under this raw lock and
 calls kvm_vcpu_kick which tries to wake_up some thread - scheduling
 while atomic.

Hmm, a wakeup itself is fine. Is that code waking a wake queue ?
 
 This used to work up to 956f97cf. -rt for 2.6.31 is not yet affected,
 but 2.6.33 should be broken (haven't checked, using kvm-kmod over 2.6.31
 ATM). I can provide a patch that restores the deferred kicking if it's
 acceptable for upstream.

Well, at least is would be nice to have one for -rt.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] x86: kvm: Convert i8254/i8259 locks to raw_spinlocks

2010-02-17 Thread Thomas Gleixner

The i8254/i8259 locks need to be real spinlocks on preempt-rt. Convert
them to raw_spinlock. No change for !RT kernels.

Signed-off-by: Thomas Gleixner t...@linutronix.de

---
 arch/x86/kvm/i8254.c |   10 +-
 arch/x86/kvm/i8254.h |2 +-
 arch/x86/kvm/i8259.c |   31 ---
 arch/x86/kvm/irq.h   |2 +-
 arch/x86/kvm/x86.c   |8 
 5 files changed, 27 insertions(+), 26 deletions(-)

Index: linux-2.6-tip/arch/x86/kvm/i8254.c
===
--- linux-2.6-tip.orig/arch/x86/kvm/i8254.c
+++ linux-2.6-tip/arch/x86/kvm/i8254.c
@@ -242,11 +242,11 @@ static void kvm_pit_ack_irq(struct kvm_i
 {
struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
 irq_ack_notifier);
-   spin_lock(ps-inject_lock);
+   raw_spin_lock(ps-inject_lock);
if (atomic_dec_return(ps-pit_timer.pending)  0)
atomic_inc(ps-pit_timer.pending);
ps-irq_ack = 1;
-   spin_unlock(ps-inject_lock);
+   raw_spin_unlock(ps-inject_lock);
 }
 
 void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
@@ -624,7 +624,7 @@ struct kvm_pit *kvm_create_pit(struct kv
 
mutex_init(pit-pit_state.lock);
mutex_lock(pit-pit_state.lock);
-   spin_lock_init(pit-pit_state.inject_lock);
+   raw_spin_lock_init(pit-pit_state.inject_lock);
 
kvm-arch.vpit = pit;
pit-kvm = kvm;
@@ -723,12 +723,12 @@ void kvm_inject_pit_timer_irqs(struct kv
/* Try to inject pending interrupts when
 * last one has been acked.
 */
-   spin_lock(ps-inject_lock);
+   raw_spin_lock(ps-inject_lock);
if (atomic_read(ps-pit_timer.pending)  ps-irq_ack) {
ps-irq_ack = 0;
inject = 1;
}
-   spin_unlock(ps-inject_lock);
+   raw_spin_unlock(ps-inject_lock);
if (inject)
__inject_pit_timer_intr(kvm);
}
Index: linux-2.6-tip/arch/x86/kvm/i8254.h
===
--- linux-2.6-tip.orig/arch/x86/kvm/i8254.h
+++ linux-2.6-tip/arch/x86/kvm/i8254.h
@@ -27,7 +27,7 @@ struct kvm_kpit_state {
u32speaker_data_on;
struct mutex lock;
struct kvm_pit *pit;
-   spinlock_t inject_lock;
+   raw_spinlock_t inject_lock;
unsigned long irq_ack;
struct kvm_irq_ack_notifier irq_ack_notifier;
 };
Index: linux-2.6-tip/arch/x86/kvm/i8259.c
===
--- linux-2.6-tip.orig/arch/x86/kvm/i8259.c
+++ linux-2.6-tip/arch/x86/kvm/i8259.c
@@ -44,18 +44,19 @@ static void pic_clear_isr(struct kvm_kpi
 * Other interrupt may be delivered to PIC while lock is dropped but
 * it should be safe since PIC state is already updated at this stage.
 */
-   spin_unlock(s-pics_state-lock);
+   raw_spin_unlock(s-pics_state-lock);
kvm_notify_acked_irq(s-pics_state-kvm, SELECT_PIC(irq), irq);
-   spin_lock(s-pics_state-lock);
+   raw_spin_lock(s-pics_state-lock);
 }
 
 void kvm_pic_clear_isr_ack(struct kvm *kvm)
 {
struct kvm_pic *s = pic_irqchip(kvm);
-   spin_lock(s-lock);
+
+   raw_spin_lock(s-lock);
s-pics[0].isr_ack = 0xff;
s-pics[1].isr_ack = 0xff;
-   spin_unlock(s-lock);
+   raw_spin_unlock(s-lock);
 }
 
 /*
@@ -156,9 +157,9 @@ static void pic_update_irq(struct kvm_pi
 
 void kvm_pic_update_irq(struct kvm_pic *s)
 {
-   spin_lock(s-lock);
+   raw_spin_lock(s-lock);
pic_update_irq(s);
-   spin_unlock(s-lock);
+   raw_spin_unlock(s-lock);
 }
 
 int kvm_pic_set_irq(void *opaque, int irq, int level)
@@ -166,14 +167,14 @@ int kvm_pic_set_irq(void *opaque, int ir
struct kvm_pic *s = opaque;
int ret = -1;
 
-   spin_lock(s-lock);
+   raw_spin_lock(s-lock);
if (irq = 0  irq  PIC_NUM_PINS) {
ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
  s-pics[irq  3].imr, ret == 0);
}
-   spin_unlock(s-lock);
+   raw_spin_unlock(s-lock);
 
return ret;
 }
@@ -203,7 +204,7 @@ int kvm_pic_read_irq(struct kvm *kvm)
int irq, irq2, intno;
struct kvm_pic *s = pic_irqchip(kvm);
 
-   spin_lock(s-lock);
+   raw_spin_lock(s-lock);
irq = pic_get_irq(s-pics[0]);
if (irq = 0) {
pic_intack(s-pics[0], irq);
@@ -228,7 +229,7 @@ int kvm_pic_read_irq(struct kvm *kvm)
intno = s-pics[0].irq_base + irq;
}
pic_update_irq(s);
-   spin_unlock(s-lock);
+   raw_spin_unlock(s-lock);
 
return intno;
 }
@@ -442,7 +443,7 @@ static int picdev_write

Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]

2009-11-30 Thread Thomas Gleixner

On Mon, 30 Nov 2009, Tejun Heo wrote:
 Hello,
 
 On 11/28/2009 09:12 PM, Avi Kivity wrote:
  Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call
  into the irqs disabled section recently.
 
   sched, kvm: Fix race condition involving sched_in_preempt_notifers
 
   In finish_task_switch(), fire_sched_in_preempt_notifiers() is
   called after finish_lock_switch().
 
   However, depending on architecture, preemption can be enabled after
   finish_lock_switch() which breaks the semantics of preempt
   notifiers.
 
   So move it before finish_arch_switch(). This also makes the in-
   notifiers symmetric to out- notifiers in terms of locking - now
   both are called under rq lock.
 
  It's not a surprise that this breaks the existing code which does the
  smp function call.
  
  Yes, kvm expects preempt notifiers to be run with irqs enabled.  Copying
  patch author.
 
 Hmmm... then, it's broken both ways.  The previous code may get
 preempted after scheduling but before the notifier is run (which
 breaks the semantics of the callback horribly), the current code

No, it _CANNOT_ be preempted at that point:

schedule()
{
preempt_disable();

switch_to();

preempt_enable();
}

 doesn't satisfy kvm's requirement.  Another thing is that in the
 previous implementation the context is different between the 'in' and
 'out' callbacks, which is subtle and nasty.  Can kvm be converted to
 not do smp calls directly?
 
 For the time being, maybe it's best to back out the fix given that the
 only architecture which may be affected by the original bug is ia64
 which is the only one with both kvm and the unlocked context switch.

Do you have a pointer to the original bug report ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]

2009-11-27 Thread Thomas Gleixner

On Fri, 27 Nov 2009, Peter Zijlstra wrote:

 On Fri, 2009-11-27 at 16:03 +0100, Jiri Slaby wrote:
  On 11/25/2009 01:47 AM, a...@linux-foundation.org wrote:
   The mm-of-the-moment snapshot 2009-11-24-16-47 has been uploaded to
  
  Hi, when executing qemu-kvm I often get following warning and a hard lockup.
  
  WARNING: at kernel/smp.c:292 smp_call_function_single+0xbd/0x140()
  Hardware name: To Be Filled By O.E.M.
  Modules linked in: kvm_intel kvm fuse ath5k ath
  Pid: 3265, comm: qemu-kvm Not tainted 2.6.32-rc8-mm1_64 #912
  Call Trace:
   [81039678] warn_slowpath_common+0x78/0xb0
   [a007fd50] ? __vcpu_clear+0x0/0xd0 [kvm_intel]
   [810396bf] warn_slowpath_null+0xf/0x20
   [8106410d] smp_call_function_single+0xbd/0x140
   [a0080af6] vmx_vcpu_load+0x46/0x170 [kvm_intel]
   [a004dd94] kvm_arch_vcpu_load+0x24/0x60 [kvm]
   [a0047a8d] kvm_sched_in+0xd/0x10 [kvm]
   [8102de37] finish_task_switch+0x67/0xc0
   [814699f8] schedule+0x2f8/0x9c0
 
  
  It is a regression against 2009-11-13-19-59.
  
  Any ideas?
 
 Looks like kvm is trying to send an IPI from the preempt notifiers,
 which are called with IRQs disabled, not a sane thing to do.
 
 If they really want that, they'll have to use a pre-allocated struct
 call_single_data and use __smp_call_function_single.

Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call
into the irqs disabled section recently.

sched, kvm: Fix race condition involving sched_in_preempt_notifers

In finish_task_switch(), fire_sched_in_preempt_notifiers() is
called after finish_lock_switch().

However, depending on architecture, preemption can be enabled after
finish_lock_switch() which breaks the semantics of preempt
notifiers.

So move it before finish_arch_switch(). This also makes the in-
notifiers symmetric to out- notifiers in terms of locking - now
both are called under rq lock.

It's not a surprise that this breaks the existing code which does the
smp function call.

Thanks,

tglx



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: kernel/smp.c:292 smp_call_function_single [Was: mmotm 2009-11-24-16-47 uploaded]

2009-11-27 Thread Thomas Gleixner

On Fri, 27 Nov 2009, Thomas Gleixner wrote:
 On Fri, 27 Nov 2009, Peter Zijlstra wrote:
 
  On Fri, 2009-11-27 at 16:03 +0100, Jiri Slaby wrote:
   On 11/25/2009 01:47 AM, a...@linux-foundation.org wrote:
The mm-of-the-moment snapshot 2009-11-24-16-47 has been uploaded to
   
   Hi, when executing qemu-kvm I often get following warning and a hard 
   lockup.
   
   WARNING: at kernel/smp.c:292 smp_call_function_single+0xbd/0x140()
   Hardware name: To Be Filled By O.E.M.
   Modules linked in: kvm_intel kvm fuse ath5k ath
   Pid: 3265, comm: qemu-kvm Not tainted 2.6.32-rc8-mm1_64 #912
   Call Trace:
[81039678] warn_slowpath_common+0x78/0xb0
[a007fd50] ? __vcpu_clear+0x0/0xd0 [kvm_intel]
[810396bf] warn_slowpath_null+0xf/0x20
[8106410d] smp_call_function_single+0xbd/0x140
[a0080af6] vmx_vcpu_load+0x46/0x170 [kvm_intel]
[a004dd94] kvm_arch_vcpu_load+0x24/0x60 [kvm]
[a0047a8d] kvm_sched_in+0xd/0x10 [kvm]
[8102de37] finish_task_switch+0x67/0xc0
[814699f8] schedule+0x2f8/0x9c0
  
   
   It is a regression against 2009-11-13-19-59.
   
   Any ideas?
  
  Looks like kvm is trying to send an IPI from the preempt notifiers,
  which are called with IRQs disabled, not a sane thing to do.
  
  If they really want that, they'll have to use a pre-allocated struct
  call_single_data and use __smp_call_function_single.
 
 Hmm, commit 498657a moved the fire_sched_in_preempt_notifiers() call
 into the irqs disabled section recently.
 
 sched, kvm: Fix race condition involving sched_in_preempt_notifers
 
 In finish_task_switch(), fire_sched_in_preempt_notifiers() is
 called after finish_lock_switch().
 
 However, depending on architecture, preemption can be enabled after
 finish_lock_switch() which breaks the semantics of preempt
 notifiers.

This is patently wrong btw.

schedule()
{

need_resched:
preempt_disable();

task_switch();


preempt_enable_no_resched();
if (need_resched())
goto need_resched;
}

 
 So move it before finish_arch_switch(). This also makes the in-
 notifiers symmetric to out- notifiers in terms of locking - now
 both are called under rq lock.
 
 It's not a surprise that this breaks the existing code which does the
 smp function call.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner

On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
 On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
 What about getting rid of the retry loop, instead? So something
 like:
 
 - run hrtimer callbacks (once)
 - while (tick_program_event(expires))
   expires = ktime_add_ns(expires, dev-min_delta_ns)
 
 This way there's no static tuning involved.

And what does that buy us ? We get an timer interrupt right away, so
it's not that much different from the retry loop. See below.

 Its not clear to me why the loop is there in the first place.

We get a timer interrupt and handle the expired timers and find out
the timer which is going to expire next to reprogram the hardware. Now
when we program that expiry time we find out that the timer is already
expired. So instead of programming the hardware to fire an interrupt
in the very near future which you would do with your loop above we
stay in the interrupt handler and expire the timer and any other by
now expired timers right away.

The hang check is just there to avoid starving (slow) machines. We do
this by spreading the timer interrupts out so that the system can do
something else than expiring timers.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner

On Thu, 8 Oct 2009, Michael Tokarev wrote:
 Yesterday I was lucky enough to actually watch what's
 going on when the delay actually happens.
 
 I run desktop environment on a kvm virtual machine here.
 The server is on diskless terminal, and the rest, incl.
 the window manager etc, is started from a VM.
 
 And yesterday, during normal system load (nothing extra,
 and not idle either, and all the other guests were running
 under normal load too), I had a stall of everyhing on this
 X session for about 2..3, maybe 5 secounds.
 
 It felt like completely stuck machine. Nothing were moving
 on the screen, no reaction to the keyboard etc.
 
 And after several seconds it returned to normal.  With
 the familiar message in dmesg -- increasing hrtimer etc,
 to the next 50%.  (Without a patch from Marcelo at this
 time it shuold increase min_delta to a large number).
 
 To summarize: there's something, well, more interesting
 going on here.  In addition to the scheduling issues that
 causes timers to be calculated on the wrong CPU etc as

Care to elaborate ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner

On Thu, 8 Oct 2009, Michael Tokarev wrote:

 Thomas Gleixner wrote:
  On Thu, 8 Oct 2009, Michael Tokarev wrote:
   Yesterday I was lucky enough to actually watch what's
   going on when the delay actually happens.
   
   I run desktop environment on a kvm virtual machine here.
   The server is on diskless terminal, and the rest, incl.
   the window manager etc, is started from a VM.
   
   And yesterday, during normal system load (nothing extra,
   and not idle either, and all the other guests were running
   under normal load too), I had a stall of everyhing on this
   X session for about 2..3, maybe 5 secounds.
   
   It felt like completely stuck machine. Nothing were moving
   on the screen, no reaction to the keyboard etc.
   
   And after several seconds it returned to normal.  With
   the familiar message in dmesg -- increasing hrtimer etc,
   to the next 50%.  (Without a patch from Marcelo at this
   time it shuold increase min_delta to a large number).
   
   To summarize: there's something, well, more interesting
   going on here.  In addition to the scheduling issues that
   causes timers to be calculated on the wrong CPU etc as
  
  Care to elaborate ?
 
 Such huge delays (in terms of seconds, not ms or ns) - I don't
 understand how such delays can be explained by sheduling to the
 different cpu etc.  That's what I mean.  I know very little about
 all this low-level stuff so I may be completely out of context,
 but such explanation does not look right to me, simple as that.
 By scheduling mistakes we can get mistakes in range of millisecs,
 but not secs.

I'm really missing the big picture here. 

What means causes timers to be calculated on the wrong CPU etc ?
And what do you consider a scheduling mistake ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner

On Thu, 8 Oct 2009, Michael Tokarev wrote:
 Thomas Gleixner wrote:
  
  I'm really missing the big picture here. 
  What means causes timers to be calculated on the wrong CPU etc ?
  And what do you consider a scheduling mistake ?
 
 From the initial diagnostics by Marcelo:
 
  It seems the way hrtimer_interrupt_hanging calculates min_delta is
  wrong (especially to virtual machines). The guest vcpu can be scheduled
  out during the execution of the hrtimer callbacks (and the callbacks
  themselves can do operations that translate to blocking operations in
  the hypervisor).
 
  So high min_delta values can be calculated if, for example, a single
  hrtimer_interrupt run takes two host time slices to execute, while some
  other higher priority task runs for N slices in between.
 
 From this I conclude that the huge min_delta is due to some other task(s)
 on the host being run while this guest is in hrtimer callback.  But I
 fail to see why that process on the host takes SO MUCH time, to warrant
 resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
 guest.  It's ok to have delays in range of several extra milliseconds,
 but for *seconds* is too much.
 
 Note again that neither host nor guest are not under high load when
 this jump happens.  Also note that there's no high-priority processes
 running on the host, all are of the same priority level, including
 all the guests.
 
 Note also that so far I only see it on SMP guests, never on UP
 guests.  And only on guests with kvm_clock, not with acpi_pm
 clocksource.
 
 What I'm trying to say is that it looks like there's something
 else wrong here in the guest code.  Huge stalls, huge delays
 while in hrtimer callback (i think it jappens always when such
 delay is happening, it's just noticed by hrtimer code) -- that's
 the root cause of all this, (probably) wrong logic in hrtimer
 calibration just shows the results of something that's wrong
 elsewhere.

Ah, ok. That makes sense. The hrtimer interrupt hang check detects
that the CPU was stolen for whatever reasons. I'm wondering why this
happens several times in a row - it takes at least 4 iterations until
it decides to make the interval larger.

Thanks,

tglx
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner

On Thu, 8 Oct 2009, Marcelo Tosatti wrote:
 On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote:
  On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
   On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
   What about getting rid of the retry loop, instead? So something
   like:
   
   - run hrtimer callbacks (once)
   - while (tick_program_event(expires))
 expires = ktime_add_ns(expires, dev-min_delta_ns)
   
   This way there's no static tuning involved.
  
  And what does that buy us ? We get an timer interrupt right away, so
  it's not that much different from the retry loop. See below.
  
   Its not clear to me why the loop is there in the first place.
  
  We get a timer interrupt and handle the expired timers and find out
  the timer which is going to expire next to reprogram the hardware. Now
  when we program that expiry time we find out that the timer is already
  expired. So instead of programming the hardware to fire an interrupt
  in the very near future which you would do with your loop above we
  stay in the interrupt handler and expire the timer and any other by
  now expired timers right away.
  
  The hang check is just there to avoid starving (slow) machines. We do
  this by spreading the timer interrupts out so that the system can do
  something else than expiring timers.
 
 OK, makes sense.
 
 So why not program only the next tick using the heuristic, without 
 touching min_delta_ns?

That makes a certain amount of sense albeit I really hate that
heuristics crap especially when we know that we run as a guest. 

We better add a function pointer to the clock event device struct
which defaults to force it to be slow for real hardware and can be
overridden by paravirt guests.

Also it's not clear to me why the problem does only happen with
kvm_clock and not with acpi_pm timer emulation (according to the
reporter) and is restricted to SMP guests.

   retry:
   /* 5 retries is enough to notice a hang */
 - if (!(++nr_retries % 5))
 - hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
 + if (!(++nr_retries % 5)) {
 + ktime_t try_time = ktime_sub(ktime_get(), now);
 +
 + do {
 + for (i = 0; i  3; i++)
 + expires_next = ktime_add(expires_next,try_time);
 + } while (tick_program_event(expires_next, 0));

  This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc,
  ...) where we can find out how often this happens.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/3] Patches for KVM RT

2009-04-10 Thread Thomas Gleixner

On Thu, 9 Apr 2009, Carsten Emde wrote:

 Jan,
 
  Here are some patches that are necessary to get KVM running with the -rt4
  patchset.
 Thanks a lot.
 
 Unfortunaately, there is still a last one at kernel/smp.c:288
 /* Can deadlock when called with interrupts disabled */
 WARN_ON_ONCE(irqs_disabled()  !oops_in_progress);
 
 Do we get another fix?

I think I have seen that before. Just remembered that I fixed that
with Avi last year. Patch got dropped in the 26-29 move.

Thanks,

tglx



From: Thomas Gleixner t...@linutronix.de
Date: Mon, 14 Jan 2008 14:02:44 +0200
Subject: CFS: enable irqs in fire_sched_in_preempt_notifier

KVM expects the notifier call with irqs enabled. It's necessary due
to a possible IPI call. Make the preempt-rt version behave the same
way as mainline.

Signed-off-by: Thomas Gleixner t...@linutronix.de

---
 kernel/sched.c |9 +
 1 file changed, 9 insertions(+)

Index: linux-2.6.24.7-rt27/kernel/sched.c
===
--- linux-2.6.24.7-rt27.orig/kernel/sched.c 2009-02-08 00:01:16.0 
-0500
+++ linux-2.6.24.7-rt27/kernel/sched.c  2009-02-08 00:01:22.0 -0500
@@ -1821,8 +1821,17 @@ static void fire_sched_in_preempt_notifi
struct preempt_notifier *notifier;
struct hlist_node *node;
 
+   if (hlist_empty(curr-preempt_notifiers))
+   return;
+
+   /*
+* The KVM sched in notifier expects to be called with
+* interrupts enabled.
+*/
+   local_irq_enable();
hlist_for_each_entry(notifier, node, curr-preempt_notifiers, link)
notifier-ops-sched_in(notifier, raw_smp_processor_id());
+   local_irq_disable();
 }
 
 static void
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/12] expose ACPI pmtimer to userspace (/dev/pmtimer)

2008-06-04 Thread Thomas Gleixner

On Wed, 4 Jun 2008, Avi Kivity wrote:
 Anthony Liguori wrote:
  Thomas Gleixner wrote:
   Can we please keep that code inside of drivers/clocksource/acpi_pm.c
   without creating a new disconnected file in drivers/char ?
   
   Btw, depending on the use case we might as well have a sysfs entry for
   that.
  
  I think sysfs would actually make a lot of sense for this.
  
 
 It's read many thousands of times per second.  You don't want a
 read()/sprintf()/atoi() sequence every time.

Eek, according to Andrea it's only used for migration purpose.

Thanks,
tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: using smp_processor_id() in preemptible [00000000] code: pm-suspend/17334

2008-06-04 Thread Thomas Gleixner

On Tue, 3 Jun 2008, Jiri Kosina wrote:
  Ahh yes - you are right , I've completely forget about that old post -
  I've thought  that my post are usually getting fixed sooner :)
  So yes - this is actually the same bug which is still not fixed within
  the latest kernel - the machine is running qemu guest (which seems to
  me now somehow also slower)
 
 OK, so it looks like KVM could be wrongly enabling IRQs/preemption on the 
 resume path. The original bug-report is on 
 http://lkml.org/lkml/2008/4/7/130

sysdev_resume() is supposed to run with interrupts disabled, at least
it was that way when the timekeeping_resume code was written.

Thanks,
tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/12] expose ACPI pmtimer to userspace (/dev/pmtimer)

2008-06-01 Thread Thomas Gleixner

On Sun, 1 Jun 2008, Marcelo Tosatti wrote:
 On Sun, Jun 01, 2008 at 06:34:27PM +0200, Thomas Gleixner wrote:
 
 A sysfs entry sounds fine and much simpler. Should probably be a generic
 clocksource interface (so userspace can read any available clocksource)
 rather than acpi_pm specific.

Agreed.
 
  return clocksource_acpi_pm.read == acpi_pm_read;
  
  So we don't need reliable_pmtimer at all.
 
 For KVM's use case, we'd rather not allow direct pmtimer access if the
 host has an unreliable (buggy) chipset.

well, return clocksource_acpi_pm.read == acpi_pm_read; is supposed
to do that just without an additional variable reliable_pmtimer :)

 But then, I doubt any of those older affected chipsets have HW
 virtualization support, so it shouldnt be an issue.

It's exactly one old crappy chipset, which definitely has no HW virt
support and therefor we just can use read_pmtmr() w/o checking for
reliable or not.

Thanks,
tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

91 matches

Mail list logo