date:20131126

On Tue, Nov 26, 2013 at 04:02:36PM +0800, Arthur Chunqi Li wrote:
 Hi Gleb,
 
 I noticed that arch/x86/kvm/pmu.c is on your management and I have
 some questions about PMU in KVM. Thanks ahead if you can spare time
 answering these questions.
 
 1. How could PMU cooperate with Intel VT? For example, I only find
 flags in IA32_PERFEVTSELx MSRs to count in OS and USER mode (Ring 0
 and other rings). What is the consequence when I open VMXON with PMU
 enabled? Can I distinguish the counts in root and non-root mode? I
 cannot find the related descriptions in Intel manual.
 
No you cannot. You can disable/enable PMU counters before/after
vmexit/vmentry and in this way know in what sate event was counted. That
what MPU emulation does.

 2. What is the current status of vPMU in KVM? Is it auto-enabled? And
 how can I use (or enable/disable) it?
 
If you run with -cpu host on Intel machine it will be enabled.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
 
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
  This idea is based on the fact that the number of rmap is limited by
  RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
  we can break the rewalk at once, in the case of deleting, we can only
  rewalk RMAP_RECYCLE_THRESHOLD times.
  
  Please explain in more detail.
 
 Okay.
 
 My proposal is like this:
 
 pte_list_walk_lockless()
 {
 restart:
 
 + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 + return;
 
   code-doing-lockless-walking;
   ..
 }
 
 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.
 
 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
The point is that rmap entry that you are inspecting can be constantly
deleted and added to the beginning of some other list, so the code that
traverse the list will never reach the end.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
 On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
  On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
  xiaoguangr...@linux.vnet.ibm.com wrote:
 
  On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  snip complicated stuff about parent_pte
 
  I'm not really following, but note that parent_pte predates EPT (and
  the use of rcu in kvm), so all the complexity that is the result of
  trying to pack as many list entries into a cache line can be dropped.
  Most setups now would have exactly one list entry, which is handled
  specially antyway.
 
  Alternatively, the trick of storing multiple entries in one list entry
  can be moved to generic code, it may be useful to others.
  
  Yes, can the lockless list walking code be transformed into generic
  single-linked list walking? So the correctness can be verified
  independently, and KVM becomes a simple user of that interface.
 
 I'am afraid the signle-entry list is not so good as we expected. In my
 experience, there're too many entries on rmap, more than 300 sometimes.
 (consider a case that a lib shared by all processes).
 
This is without EPT though and non EPT HW is not performance king
anyway. Nested EPT uses shadow paging too, but VMs hardly share any
pages. With KSM they may though.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Zhanghaoyu (A)

Hi all,

When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL 
return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update 
the irq routing table,
in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
blocked for so much time to wait RCU grace period, and during this period, this 
vcpu cannot provide service to VM,
so those interrupts delivered to this vcpu cannot be handled in time, and the 
apps running on this vcpu cannot be serviced too.
It's unacceptable in some real-time scenario, e.g. telecom. 

So, I want to create a single workqueue for each VM, to asynchronously 
performing the RCU synchronization for irq routing table, 
and let the vcpu thread return and VMENTRY to service VM immediately, no more 
need to blocked to wait RCU grace period.
And, I have implemented a raw patch, took a test in our telecom environment, 
above problem disappeared.

Any better ideas?

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
 When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
 IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to 
 update the irq routing table,
 in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
 blocked for so much time to wait RCU grace period, and during this period, 
 this vcpu cannot provide service to VM,
 so those interrupts delivered to this vcpu cannot be handled in time, and the 
 apps running on this vcpu cannot be serviced too.
 It's unacceptable in some real-time scenario, e.g. telecom. 
 
 So, I want to create a single workqueue for each VM, to asynchronously 
 performing the RCU synchronization for irq routing table, 
 and let the vcpu thread return and VMENTRY to service VM immediately, no more 
 need to blocked to wait RCU grace period.
 And, I have implemented a raw patch, took a test in our telecom environment, 
 above problem disappeared.

I don't think a workqueue is even needed.  You just need to use call_rcu
to free old after releasing kvm-irq_lock.

What do you think?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote:
 Hi all,
 
 When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
 IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to 
 update the irq routing table,
Why vcpu thread ask the hypervisor to update the irq routing table on
pcpu migration?

 in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
 blocked for so much time to wait RCU grace period, and during this period, 
 this vcpu cannot provide service to VM,
 so those interrupts delivered to this vcpu cannot be handled in time, and the 
 apps running on this vcpu cannot be serviced too.
 It's unacceptable in some real-time scenario, e.g. telecom. 
 
 So, I want to create a single workqueue for each VM, to asynchronously 
 performing the RCU synchronization for irq routing table, 
 and let the vcpu thread return and VMENTRY to service VM immediately, no more 
 need to blocked to wait RCU grace period.
 And, I have implemented a raw patch, took a test in our telecom environment, 
 above problem disappeared.
 
 Any better ideas?
 
 Thanks,
 Zhang Haoyu

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 02:48:10PM +0200, Gleb Natapov wrote:
 On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote:
  Hi all,
  
  When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
  IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor 
  to update the irq routing table,
 Why vcpu thread ask the hypervisor to update the irq routing table on
 pcpu migration?
 
Ah, I misread. Guest sets irq smp_affinity not host.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
 Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
  When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
  IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor 
  to update the irq routing table,
  in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
  blocked for so much time to wait RCU grace period, and during this period, 
  this vcpu cannot provide service to VM,
  so those interrupts delivered to this vcpu cannot be handled in time, and 
  the apps running on this vcpu cannot be serviced too.
  It's unacceptable in some real-time scenario, e.g. telecom. 
  
  So, I want to create a single workqueue for each VM, to asynchronously 
  performing the RCU synchronization for irq routing table, 
  and let the vcpu thread return and VMENTRY to service VM immediately, no 
  more need to blocked to wait RCU grace period.
  And, I have implemented a raw patch, took a test in our telecom 
  environment, above problem disappeared.
 
 I don't think a workqueue is even needed.  You just need to use call_rcu
 to free old after releasing kvm-irq_lock.
 
 What do you think?
 
It should be rate limited somehow. Since it guest triggarable guest may cause
host to allocate a lot of memory this way.

Is this about MSI interrupt affinity? IIRC changing INT interrupt
affinity should not trigger kvm_set_irq_routing update. If this is about
MSI only then what about changing userspace to use KVM_SIGNAL_MSI for
MSI injection?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 13:56, Gleb Natapov ha scritto:
  I don't think a workqueue is even needed.  You just need to use call_rcu
  to free old after releasing kvm-irq_lock.
  
  What do you think?
 
 It should be rate limited somehow. Since it guest triggarable guest may cause
 host to allocate a lot of memory this way.

True, though if I understand Zhanghaoyu's proposal a workqueue would be
even worse.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 14:18, Avi Kivity ha scritto:
 
 I don't think a workqueue is even needed.  You just need to use call_rcu
 to free old after releasing kvm-irq_lock.

 What do you think?
 
 Can this cause an interrupt to be delivered to the wrong (old) vcpu?

No, this would be exactly the same code that is running now:

mutex_lock(kvm-irq_lock);
old = kvm-irq_routing;
kvm_irq_routing_update(kvm, new);
mutex_unlock(kvm-irq_lock);

synchronize_rcu();
kfree(old);
return 0;

Except that the kfree would run in the call_rcu kernel thread instead of
the vcpu thread.  But the vcpus already see the new routing table after
the rcu_assign_pointer that is in kvm_irq_routing_update.

There is still the problem that Gleb pointed out, though.

Paolo

 The way Linux sets interrupt affinity, it cannot, since changing the
 affinity is (IIRC) done in the interrupt handler, so the next interrupt
 cannot be in flight and thus pick up the old interrupt routing table.
 
 However it may be vulnerable in other ways.
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qmp signal event

2013-11-26 Thread Jose A. Lopes

Hi,

We're connecting to an instance using the Qmp socket and listening on
the events in order to determine whether the shutdown came from inside
the VM or it was issued by the host, for example, by piping
'system_powerdown'.

However, we're unable to distinguish between a host shutdown issued
through Qmp and a TERM signal issued by kill, for example.  Is there a
way to determine if KVM received a TERM signal?  Is it possible to
send an event, just like the ones used for 'SHUTDOWN' and 'POWERDOWN',
through Qmp?

Cheers,
Jose

-- 
Jose Antonio Lopes
Ganeti Engineering
Google Germany GmbH
Dienerstr. 12, 80331, München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Steuernummer: 48/725/00206
Umsatzsteueridentifikationsnummer: DE813741370
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2013-11-26

2013-11-26 Thread Juan Quintela

Juan Quintela quint...@redhat.com wrote:
 Hi

 Please, send any topic that you are interested in covering.

As there are no topics for agenda, call is cancelled.

Later,  Juan.




 Thanks, Juan.

 Call details:

 10:00 AM to 11:00 AM EDT
 Every two weeks

 If you need phone number details,  contact me privately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 15:36, Avi Kivity ha scritto:
 
 No, this would be exactly the same code that is running now:
 
 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);
 
 synchronize_rcu();
 kfree(old);
 return 0;
 
 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.
 
 I understood the proposal was also to eliminate the synchronize_rcu(),
 so while new interrupts would see the new routing table, interrupts
 already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).


With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).


It's another question whether the hardware provides the same guarantee.


If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.


Sure, if you drop the synchronize_rcu(), you have to add call_rcu().
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:
 On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
 Il 26/11/2013 15:36, Avi Kivity ha scritto:
  No, this would be exactly the same code that is running now:
 
  mutex_lock(kvm-irq_lock);
  old = kvm-irq_routing;
  kvm_irq_routing_update(kvm, new);
  mutex_unlock(kvm-irq_lock);
 
  synchronize_rcu();
  kfree(old);
  return 0;
 
  Except that the kfree would run in the call_rcu kernel thread instead 
  of
  the vcpu thread.  But the vcpus already see the new routing table after
  the rcu_assign_pointer that is in kvm_irq_routing_update.
 
 I understood the proposal was also to eliminate the synchronize_rcu(),
 so while new interrupts would see the new routing table, interrupts
 already in flight could pick up the old one.
 Isn't that always the case with RCU?  (See my answer above: the vcpus
 already see the new routing table after the rcu_assign_pointer that is
 in kvm_irq_routing_update).
 
 With synchronize_rcu(), you have the additional guarantee that any
 parallel accesses to the old routing table have completed.  Since we
 also trigger the irq from rcu context, you know that after
 synchronize_rcu() you won't get any interrupts to the old
 destination (see kvm_set_irq_inatomic()).
We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 16:03, Gleb Natapov ha scritto:
  I understood the proposal was also to eliminate the synchronize_rcu(),
  so while new interrupts would see the new routing table, interrupts
  already in flight could pick up the old one.
  Isn't that always the case with RCU?  (See my answer above: the vcpus
  already see the new routing table after the rcu_assign_pointer that is
  in kvm_irq_routing_update).
  
  With synchronize_rcu(), you have the additional guarantee that any
  parallel accesses to the old routing table have completed.  Since we
  also trigger the irq from rcu context, you know that after
  synchronize_rcu() you won't get any interrupts to the old
  destination (see kvm_set_irq_inatomic()).
 We do not have this guaranty for other vcpus that do not call
 synchronize_rcu(). They may still use outdated routing table while a vcpu
 or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 05:03 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().



Consider this guest code:

  write msi entry, directing the interrupt away from this vcpu
  nop
  memset(idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.


Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 05:20 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.


An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 16:25, Avi Kivity ha scritto:
 If we want to ensure, we need to use a different mechanism for
 synchronization than the global RCU.  QRCU would work; readers are not
 wait-free but only if there is a concurrent synchronize_qrcu, which
 should be rare.
 
 An alternative path is to convince ourselves that the hardware does not
 provide the guarantees that the current code provides, and so we can
 relax them.

No, I think it's a reasonable guarantee to provide.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 05:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:25, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.



Why?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan

2013-11-26 Thread Stefan Hajnoczi

On Sun, Nov 24, 2013 at 11:22:17AM +0200, Razya Ladelsky wrote:
 5. Add heuristics to improve I/O scheduling 
 This patch enhances the round-robin mechanism with a set of heuristics to 
 decide when to leave a virtqueue and proceed to the next.
 https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d

This patch should probably do something portable instead of relying on
x86-only rdtscll().

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 16:35, Avi Kivity ha scritto:
 If we want to ensure, we need to use a different mechanism for
 synchronization than the global RCU.  QRCU would work; readers are not
 wait-free but only if there is a concurrent synchronize_qrcu, which
 should be rare.
 An alternative path is to convince ourselves that the hardware does not
 provide the guarantees that the current code provides, and so we can
 relax them.
 No, I think it's a reasonable guarantee to provide.
 
 Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

(BTW, PCI memory writes are posted, but configuration writes are not).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
 On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
  Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
   When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
   IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor 
   to update the irq routing table,
   in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
   blocked for so much time to wait RCU grace period, and during this 
   period, this vcpu cannot provide service to VM,
   so those interrupts delivered to this vcpu cannot be handled in time, and 
   the apps running on this vcpu cannot be serviced too.
   It's unacceptable in some real-time scenario, e.g. telecom. 
   
   So, I want to create a single workqueue for each VM, to asynchronously 
   performing the RCU synchronization for irq routing table, 
   and let the vcpu thread return and VMENTRY to service VM immediately, no 
   more need to blocked to wait RCU grace period.
   And, I have implemented a raw patch, took a test in our telecom 
   environment, above problem disappeared.
  
  I don't think a workqueue is even needed.  You just need to use call_rcu
  to free old after releasing kvm-irq_lock.
  
  What do you think?
  
 It should be rate limited somehow. Since it guest triggarable guest may cause
 host to allocate a lot of memory this way.

The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
counter, which can be adjusted via rcutree.blimit, which defaults
to taking evasive action if more than 10K callbacks are waiting on a
given CPU.



 Is this about MSI interrupt affinity? IIRC changing INT interrupt
 affinity should not trigger kvm_set_irq_routing update. If this is about
 MSI only then what about changing userspace to use KVM_SIGNAL_MSI for
 MSI injection?
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.


It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.



(BTW, PCI memory writes are posted, but configuration writes are not).


MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:
 On 11/26/2013 05:58 PM, Paolo Bonzini wrote:
 Il 26/11/2013 16:35, Avi Kivity ha scritto:
 If we want to ensure, we need to use a different mechanism for
 synchronization than the global RCU.  QRCU would work; readers are not
 wait-free but only if there is a concurrent synchronize_qrcu, which
 should be rare.
 An alternative path is to convince ourselves that the hardware does not
 provide the guarantees that the current code provides, and so we can
 relax them.
 No, I think it's a reasonable guarantee to provide.
 Why?
 Because IIUC the semantics may depend not just on the interrupt
 controller, but also on the specific PCI device.  It seems safer to
 assume that at least one device/driver pair wants this to work.
 
 It's indeed safe, but I think there's a nice win to be had if we
 drop the assumption.

I'm not arguing with that, but a minor commoent below:

 (BTW, PCI memory writes are posted, but configuration writes are not).
 
 MSIs are configured via PCI memory writes.
 
 By itself, that doesn't buy us anything, since the guest could flush
 the write via a read.  But I think the fact that the interrupt
 messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.

 The fact
 that Linux does interrupt migration from within the interrupt
 handler also shows that someone else believes that it is the only
 safe place to do it.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] KVM: optimize apic interrupt delivery

On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
 On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
  On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
   On 09/12/2012 03:34 PM, Gleb Natapov wrote:
On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
   Paul, I'd like to check something with you here:
   this function can be triggered by userspace,
   any number of times; we allocate
   a 2K chunk of memory that is later freed by
   kfree_rcu.
   
   Is there a risk of DOS if RCU is delayed while
   lots of memory is queued up in this way?
   If yes is this a generic problem with kfree_rcu
   that should be addressed in core kernel?
  
  There is indeed a risk.
 
 In our case it's a 2K object. Is it a practical risk?
 
 How many kfree_rcu()s per second can a given user cause to happen?

Not much more than a few hundred thousand per second per process 
(normal
operation is zero).

I managed to do 21466 per second.
   
   Strange, why so slow?
   
  Because ftrace buffer overflows :) With bigger buffer I get 169940.
 
 Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
 a tight loop, you could probably do in excess of 100M per CPU per second.
 Now -that- might be a problem.
 
 Well, it -might- be a problem if you somehow figured out how to allocate
 memory that quickly in a steady-state manner.  ;-)
 
Good idea.  Michael, is should be easy to modify kvm-unit-tests to 
write
to the APIC ID register in a loop.

I did. Memory consumption does not grow on otherwise idle host.
 
 Very good -- the checks in __call_rcu(), which is common code invoked by
 kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
 counter, which can be adjusted via rcutree.blimit, which defaults
 to taking evasive action if more than 10K callbacks are waiting on a
 given CPU.
 
 My concern was that you might be overrunning that limit in way less
 than a grace period (as in about a hundred microseconds.  My concern
 was of course unfounded -- you take several grace periods in push 10K
 callbacks through.
 
   Thanx, Paul

Gleb noted that Documentation/RCU/checklist.txt has this text:

An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.

If call_rcu is self-limiting maybe this should be documented ...

   Ok, thanks.
   
   
   -- 
   error compiling committee.c: too many arguments to function
  
  --
  Gleb.
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 06:14:27PM +0200, Gleb Natapov wrote:
 On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
  On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
   On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
 When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread 
 will IOCTL return to QEMU from hypervisor, then vcpu thread ask the 
 hypervisor to update the irq routing table,
 in kvm_set_irq_routing, synchronize_rcu is called, current vcpu 
 thread is blocked for so much time to wait RCU grace period, and 
 during this period, this vcpu cannot provide service to VM,
 so those interrupts delivered to this vcpu cannot be handled in time, 
 and the apps running on this vcpu cannot be serviced too.
 It's unacceptable in some real-time scenario, e.g. telecom. 
 
 So, I want to create a single workqueue for each VM, to 
 asynchronously performing the RCU synchronization for irq routing 
 table, 
 and let the vcpu thread return and VMENTRY to service VM immediately, 
 no more need to blocked to wait RCU grace period.
 And, I have implemented a raw patch, took a test in our telecom 
 environment, above problem disappeared.

I don't think a workqueue is even needed.  You just need to use call_rcu
to free old after releasing kvm-irq_lock.

What do you think?

   It should be rate limited somehow. Since it guest triggarable guest may 
   cause
   host to allocate a lot of memory this way.
  
  The checks in __call_rcu(), should handle this I think.  These keep a 
  per-CPU
  counter, which can be adjusted via rcutree.blimit, which defaults
  to taking evasive action if more than 10K callbacks are waiting on a
  given CPU.
  
  
 Documentation/RCU/checklist.txt has:
 
 An especially important property of the synchronize_rcu()
 primitive is that it automatically self-limits: if grace periods
 are delayed for whatever reason, then the synchronize_rcu()
 primitive will correspondingly delay updates.  In contrast,
 code using call_rcu() should explicitly limit update rate in
 cases where grace periods are delayed, as failing to do so can
 result in excessive realtime latencies or even OOM conditions.

I just asked Paul what this means.

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we
drop the assumption.

I'm not arguing with that, but a minor commoent below:


(BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush
the write via a read.  But I think the fact that the interrupt
messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.



I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
 Il 26/11/2013 16:03, Gleb Natapov ha scritto:
   I understood the proposal was also to eliminate the 
   synchronize_rcu(),
   so while new interrupts would see the new routing table, interrupts
   already in flight could pick up the old one.
   Isn't that always the case with RCU?  (See my answer above: the vcpus
   already see the new routing table after the rcu_assign_pointer that is
   in kvm_irq_routing_update).
   
   With synchronize_rcu(), you have the additional guarantee that any
   parallel accesses to the old routing table have completed.  Since we
   also trigger the irq from rcu context, you know that after
   synchronize_rcu() you won't get any interrupts to the old
   destination (see kvm_set_irq_inatomic()).
  We do not have this guaranty for other vcpus that do not call
  synchronize_rcu(). They may still use outdated routing table while a vcpu
  or iothread that performed table update sits in synchronize_rcu().
 
 Avi's point is that, after the VCPU resumes execution, you know that no
 interrupt will be sent to the old destination because
 kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
 also called within the RCU read-side critical section.
 
 Without synchronize_rcu you could have
 
 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();
 
 where the entry is stale but the VCPU has already resumed execution.
 
So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
 e = entry from IRQ routing table
 kvm_set_msi_irq(e, irq);
 kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

  
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 06:24 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

   



Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 17:24, Gleb Natapov ha scritto:
 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();
 
 where the entry is stale but the VCPU has already resumed execution.
 
 So how is it different from what we have now:
 
 disable_irq()
 VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
 kvm_irq_routing_update(kvm, new);
 synchronize_rcu()
 VCPU resumes execution
 enable_irq()
 receive stale irq

Adding a disable/enable IRQs looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 04:58:53PM +0100, Paolo Bonzini wrote:
 Il 26/11/2013 16:35, Avi Kivity ha scritto:
  If we want to ensure, we need to use a different mechanism for
  synchronization than the global RCU.  QRCU would work; readers are not
  wait-free but only if there is a concurrent synchronize_qrcu, which
  should be rare.
  An alternative path is to convince ourselves that the hardware does not
  provide the guarantees that the current code provides, and so we can
  relax them.
  No, I think it's a reasonable guarantee to provide.
  
  Why?
 
 Because IIUC the semantics may depend not just on the interrupt
 controller, but also on the specific PCI device.  It seems safer to
 assume that at least one device/driver pair wants this to work.
 
 (BTW, PCI memory writes are posted, but configuration writes are not).
 
 Paolo

You can also do a PCI read and flush out the writes.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table


On 11/26/2013 06:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 17:24, Gleb Natapov ha scritto:

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

Adding a disable/enable IRQs looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.



Those are guest operations, which may not be there at all.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 05:28:23PM +0100, Paolo Bonzini wrote:
 Il 26/11/2013 17:24, Gleb Natapov ha scritto:
  VCPU writes to routing table
 e = entry from IRQ routing table
  kvm_irq_routing_update(kvm, new);
  VCPU resumes execution
 kvm_set_msi_irq(e, irq);
 kvm_irq_delivery_to_apic_fast();
  
  where the entry is stale but the VCPU has already resumed execution.
  
  So how is it different from what we have now:
  
  disable_irq()
  VCPU writes to routing table
   e = entry from IRQ routing table
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();
  kvm_irq_routing_update(kvm, new);
  synchronize_rcu()
  VCPU resumes execution
  enable_irq()
  receive stale irq
 
 Adding a disable/enable IRQs looks like a relatively big change.  But
 perhaps it's not for some reason I'm missing.
 
You will receive stale irq even without disable/enable IRQs of course. I
put it there so that guest would have a chance to do stupid things like
zeroing idt before receiving interrupt, but on real HW timing is
different from what we emulate, so the same race may happen even without
disable/enable IRQs.
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 06:27:47PM +0200, Avi Kivity wrote:
 On 11/26/2013 06:24 PM, Gleb Natapov wrote:
 On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
 Il 26/11/2013 16:03, Gleb Natapov ha scritto:
 I understood the proposal was also to eliminate the 
 synchronize_rcu(),
 so while new interrupts would see the new routing table, interrupts
 already in flight could pick up the old one.
 Isn't that always the case with RCU?  (See my answer above: the vcpus
 already see the new routing table after the rcu_assign_pointer that is
 in kvm_irq_routing_update).
 With synchronize_rcu(), you have the additional guarantee that any
 parallel accesses to the old routing table have completed.  Since we
 also trigger the irq from rcu context, you know that after
 synchronize_rcu() you won't get any interrupts to the old
 destination (see kvm_set_irq_inatomic()).
 We do not have this guaranty for other vcpus that do not call
 synchronize_rcu(). They may still use outdated routing table while a vcpu
 or iothread that performed table update sits in synchronize_rcu().
 Avi's point is that, after the VCPU resumes execution, you know that no
 interrupt will be sent to the old destination because
 kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
 also called within the RCU read-side critical section.
 
 Without synchronize_rcu you could have
 
  VCPU writes to routing table
 e = entry from IRQ routing table
  kvm_irq_routing_update(kvm, new);
  VCPU resumes execution
 kvm_set_msi_irq(e, irq);
 kvm_irq_delivery_to_apic_fast();
 
 where the entry is stale but the VCPU has already resumed execution.
 
 So how is it different from what we have now:
 
 disable_irq()
 VCPU writes to routing table
   e = entry from IRQ routing table
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();
 kvm_irq_routing_update(kvm, new);
 synchronize_rcu()
 VCPU resumes execution
 enable_irq()
 receive stale irq
 
 
 
 Suppose the guest did not disable_irq() and enable_irq(), but
 instead had a pci read where you have the enable_irq().  After the
 read you cannot have a stale irq (assuming the read flushes the irq
 all the way to the APIC).
There still may be race between pci read and MSI registered in IRR. I do
not believe such read can undo IRR changes.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

Il 26/11/2013 17:21, Avi Kivity ha scritto:
 It's indeed safe, but I think there's a nice win to be had if we
 drop the assumption.
 I'm not arguing with that, but a minor commoent below:

 (BTW, PCI memory writes are posted, but configuration writes are not).
 MSIs are configured via PCI memory writes.

 By itself, that doesn't buy us anything, since the guest could flush
 the write via a read.  But I think the fact that the interrupt
 messages themselves are posted proves that it is safe.
 FYI, PCI read flushes the interrupt itself in, too.
 
 I guess that kills the optimization then.  Maybe you can do qrcu,
 whatever that is.

It's srcu (a separate SRCU instance specific to the irq routing
table), which I managed to misspell twice.

Actually, it turns out that qrcu actually exists
(http://lwn.net/Articles/223752/) and has extremely fast grace periods,
but read_lock/read_unlock are also more expensive.  So it was probably
some kind of Freudian slip.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] create a single workqueue for each vm to update vm irq routing table

On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
 On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
  On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
   Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread 
will IOCTL return to QEMU from hypervisor, then vcpu thread ask the 
hypervisor to update the irq routing table,
in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread 
is blocked for so much time to wait RCU grace period, and during this 
period, this vcpu cannot provide service to VM,
so those interrupts delivered to this vcpu cannot be handled in time, 
and the apps running on this vcpu cannot be serviced too.
It's unacceptable in some real-time scenario, e.g. telecom. 

So, I want to create a single workqueue for each VM, to asynchronously 
performing the RCU synchronization for irq routing table, 
and let the vcpu thread return and VMENTRY to service VM immediately, 
no more need to blocked to wait RCU grace period.
And, I have implemented a raw patch, took a test in our telecom 
environment, above problem disappeared.
   
   I don't think a workqueue is even needed.  You just need to use call_rcu
   to free old after releasing kvm-irq_lock.
   
   What do you think?
   
  It should be rate limited somehow. Since it guest triggarable guest may 
  cause
  host to allocate a lot of memory this way.
 
 The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
 counter, which can be adjusted via rcutree.blimit, which defaults
 to taking evasive action if more than 10K callbacks are waiting on a
 given CPU.
 
 
Documentation/RCU/checklist.txt has:

An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] kvm-unit-tests/arm: initial drop

2013-11-26 Thread Andrew Jones


Sorry, just noticed this - you dropped me and the kvmarm list
from your reply.

On Wed, Nov 20, 2013 at 11:06:11PM +, María Soler Heredia wrote:
 Andrew Jones drjones at redhat.com writes:
 
  
  This series introduces arm to kvm-unit-tests.
 
  To use this you need an arm platform or simulator capable
  of running kvmarm and a qemu with the mach-virt patches[2], as
  well as the previously mentioned virtio-testdev.
 
 Hello,
 
 I have been playing with your tests for a while and I cannot seem to get
 them to work all right. When I run them disabling kvm on the arm-run script,
 they do work, but when I run them with kvm enabled they fail.
 
 This is my output:
 
 ./arm-run arm/boot.flat -smp 1 -m 256 -append 'info 0x1000 0x1000'
 qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt
 -cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -m 256 -append info
 0x1000 0x1000
 kvm [1252]: load/store instruction decoding not implemented
 error: kvm run failed Function not implemented

The above errors come from the kernel and qemu. It's easy to see under
what condition you would hit them, but it's not clear to me why that
condition is present for you.

 ./arm-run: line 16:  1251 Aborted $command $@
 Return value from qemu: 134
 FAIL boot_info
 ./arm-run arm/boot.flat -smp 1 -append 'vectors'
 qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt
 -cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -append vectors
 kvm [1257]: load/store instruction decoding not implemented
 error: kvm run failed Function not implemented
 ./arm-run: line 16:  1256 Aborted $command $@
 Return value from qemu: 134
 FAIL boot_vectors
 
 I am using FastModels Model Debugger version 8.2.028, with a model of this
 characteristics:
 
 Model:
 --
 Model name: ARM_Cortex-A15
 Instance: cluster.cpu0
 Using CADI 2.0 interface revision 0.
 Version: 8.2.72
 Generated by Core Generator: No
 Needs SimGen License: No

So far I've only tested on real hardware. So this could be the difference.

 
 running the latest stable linux release and qemu-devel's latest qemu with
 the patches indicated here
 https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02428.html plus
  [1] http://lists.nongnu.org/archive/html/qemu-devel/2013-10/msg01815.html
 
 I tested the instalation by running a linux with the same setup using this 
 call:
 qemu-system-arm \
 -display none \
 -enable-kvm \
 -kernel zImage\
 -m 128 -M virt -cpu cortex-a15 \
 -drive if=none,file=linux.img,id=fs \
 -device virtio-blk-device,drive=fs
 
 As I said, the tests pass if the kvm is not enabled and fail otherwise. I
 have added a few printfs for debugging and I can tell that the code in
 boot.c runs ok, but then when virtio_testdev is called (from
 virtio_testdev_exit) the execution throws an exception (more specifically
 the line  *tdp++ = cpu_to_le32(va_arg(va, unsigned)); inside the first while.

Hmm, even more confusing, as this isn't the first mmio access.

 
 I am not used to sending emails to this kind of list, so I don't know if I
 am being too specific, too little or maybe not even giving the right
 information. Please tell me what else you need and if you can help me solve
 this problem. 

Your details are good, but instead of just stating 'latest' for your
kernel and qemu versions, please give the exact version numbers.

I've been busy with other things lately, but I'm due to post a v2 of
this series. I should be able to finish that off this week. When I do,
I'll see if I can test it over FastModel as well this time.

Thanks for starting to poke at this!

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan

2013-11-26 Thread Anthony Liguori

Razya Ladelsky ra...@il.ibm.com writes:

Hi all,

I am Razya Ladelsky, I work at IBM Haifa virtualization team, which
developed Elvis, presented by Abel Gordon at the last KVM forum:
ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs
ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE

According to the discussions that took place at the forum, upstreaming
some of the Elvis approaches seems to be a good idea, which we would like
to pursue.

Our plan for the first patches is the following:

1.Shared vhost thread between mutiple devices
This patch creates a worker thread and worker queue shared across multiple
virtio devices
We would like to modify the patch posted in
https://github.com/abelg/virtual_io_acceleration/commit/3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766

to limit a vhost thread to serve multiple devices only if they belong to
the same VM as Paolo suggested to avoid isolation or cgroups concerns.

Another modification is related to the creation and removal of vhost
threads, which will be discussed next.

I think this is an exceptionally bad idea.

We shouldn't throw away isolation without exhausting every other
possibility.

We've seen very positive results from adding threads. We should also
look at scheduling.

Once you are scheduling multiple guests in a single vhost device, you
now create a whole new class of DoS attacks in the best case scenario.

2. Sysfs mechanism to add and remove vhost threads
This patch allows us to add and remove vhost threads dynamically.

A simpler way to control the creation of vhost threads is statically
determining the maximum number of virtio devices per worker via a kernel
module parameter (which is the way the previously mentioned patch is
currently implemented)

I'd like to ask for advice here about the more preferable way to go:
Although having the sysfs mechanism provides more flexibility, it may be a
good idea to start with a simple static parameter, and have the first
patches as simple as possible. What do you think?

3.Add virtqueue polling mode to vhost
Have the vhost thread poll the virtqueues with high I/O rate for new
buffers , and avoid asking the guest to kick us.
https://github.com/abelg/virtual_io_acceleration/commit/26616133fafb7855cc80fac070b0572fd1aaf5d0

Ack on this.

Regards,

Anthony Liguori

4. vhost statistics
This patch introduces a set of statistics to monitor different performance
metrics of vhost and our polling and I/O scheduling mechanisms. The
statistics are exposed using debugfs and can be easily displayed with a
Python script (vhost_stat, based on the old kvm_stats)
https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0

5. Add heuristics to improve I/O scheduling
This patch enhances the round-robin mechanism with a set of heuristics to
decide when to leave a virtqueue and proceed to the next.
https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d

This patch improves the handling of the requests by the vhost thread, but
could perhaps be delayed to a
later time , and not submitted as one of the first Elvis patches.
I'd love to hear some comments about whether this patch needs to be part
of the first submission.

Any other feedback on this plan will be appreciated,
Thank you,
Razya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan

2013-11-26 Thread Abel Gordon



Anthony Liguori anth...@codemonkey.ws wrote on 26/11/2013 08:05:00 PM:


 Razya Ladelsky ra...@il.ibm.com writes:

  Hi all,
 
  I am Razya Ladelsky, I work at IBM Haifa virtualization team, which
  developed Elvis, presented by Abel Gordon at the last KVM forum:
  ELVIS video:  https://www.youtube.com/watch?v=9EyweibHfEs
  ELVIS slides:
https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE
 
 
  According to the discussions that took place at the forum, upstreaming
  some of the Elvis approaches seems to be a good idea, which we would
like
  to pursue.
 
  Our plan for the first patches is the following:
 
  1.Shared vhost thread between mutiple devices
  This patch creates a worker thread and worker queue shared across
multiple
  virtio devices
  We would like to modify the patch posted in
  https://github.com/abelg/virtual_io_acceleration/commit/
 3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766
  to limit a vhost thread to serve multiple devices only if they belong
to
  the same VM as Paolo suggested to avoid isolation or cgroups concerns.
 
  Another modification is related to the creation and removal of vhost
  threads, which will be discussed next.

 I think this is an exceptionally bad idea.

 We shouldn't throw away isolation without exhausting every other
 possibility.

Seems you have missed the important details here.
Anthony, we are aware you are concerned about isolation
and you believe we should not share a single vhost thread across
multiple VMs.  That's why Razya proposed to change the patch
so we will serve multiple virtio devices using a single vhost thread
only if the devices belong to the same VM. This series of patches
will not allow two different VMs to share the same vhost thread.
So, I don't see why this will be throwing away isolation and why
this could be a exceptionally bad idea.

By the way, I remember that during the KVM forum a similar
approach of having a single data plane thread for many devices
was discussed

 We've seen very positive results from adding threads.  We should also
 look at scheduling.

...and we have also seen exceptionally negative results from
adding threads, both for vhost and data-plane. If you have lot of idle
time/cores
then it makes sense to run multiple threads. But IMHO in many scenarios you
don't have lot of idle time/cores.. and if you have them you would probably
prefer to run more VMs/VCPUshosting a single SMP VM when you have
enough physical cores to run all the VCPU threads and the I/O threads is
not a
realistic scenario.

That's why we are proposing to implement a mechanism that will enable
the management stack to configure 1 thread per I/O device (as it is today)
or 1 thread for many I/O devices (belonging to the same VM).

 Once you are scheduling multiple guests in a single vhost device, you
 now create a whole new class of DoS attacks in the best case scenario.

Again, we are NOT proposing to schedule multiple guests in a single
vhost thread. We are proposing to schedule multiple devices belonging
to the same guest in a single (or multiple) vhost thread/s.


  2. Sysfs mechanism to add and remove vhost threads
  This patch allows us to add and remove vhost threads dynamically.
 
  A simpler way to control the creation of vhost threads is statically
  determining the maximum number of virtio devices per worker via a
kernel
  module parameter (which is the way the previously mentioned patch is
  currently implemented)
 
  I'd like to ask for advice here about the more preferable way to go:
  Although having the sysfs mechanism provides more flexibility, it may
be a
  good idea to start with a simple static parameter, and have the first
  patches as simple as possible. What do you think?
 
  3.Add virtqueue polling mode to vhost
  Have the vhost thread poll the virtqueues with high I/O rate for new
  buffers , and avoid asking the guest to kick us.
  https://github.com/abelg/virtual_io_acceleration/commit/
 26616133fafb7855cc80fac070b0572fd1aaf5d0

 Ack on this.

:)

Regards,
Abel.


 Regards,

 Anthony Liguori

  4. vhost statistics
  This patch introduces a set of statistics to monitor different
performance
  metrics of vhost and our polling and I/O scheduling mechanisms. The
  statistics are exposed using debugfs and can be easily displayed with a

  Python script (vhost_stat, based on the old kvm_stats)
  https://github.com/abelg/virtual_io_acceleration/commit/
 ac14206ea56939ecc3608dc5f978b86fa322e7b0
 
 
  5. Add heuristics to improve I/O scheduling
  This patch enhances the round-robin mechanism with a set of heuristics
to
  decide when to leave a virtqueue and proceed to the next.
  https://github.com/abelg/virtual_io_acceleration/commit/
 f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d
 
  This patch improves the handling of the requests by the vhost thread,
but
  could perhaps be delayed to a
  later time , and not submitted as one of the first Elvis patches.
  I'd love to hear some comments about whether this patch needs to be

Re: [PATCHv2] KVM: optimize apic interrupt delivery

2013-11-26 Thread Paul E. McKenney

On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
 On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
  On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
   On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
On 09/12/2012 03:34 PM, Gleb Natapov wrote:
 On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
 On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
Paul, I'd like to check something with you here:
this function can be triggered by userspace,
any number of times; we allocate
a 2K chunk of memory that is later freed by
kfree_rcu.

Is there a risk of DOS if RCU is delayed while
lots of memory is queued up in this way?
If yes is this a generic problem with kfree_rcu
that should be addressed in core kernel?
   
   There is indeed a risk.
  
  In our case it's a 2K object. Is it a practical risk?
  
  How many kfree_rcu()s per second can a given user cause to happen?
 
 Not much more than a few hundred thousand per second per process 
 (normal
 operation is zero).
 
 I managed to do 21466 per second.

Strange, why so slow?

   Because ftrace buffer overflows :) With bigger buffer I get 169940.
  
  Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
  a tight loop, you could probably do in excess of 100M per CPU per second.
  Now -that- might be a problem.
  
  Well, it -might- be a problem if you somehow figured out how to allocate
  memory that quickly in a steady-state manner.  ;-)
  
 Good idea.  Michael, is should be easy to modify kvm-unit-tests to 
 write
 to the APIC ID register in a loop.
 
 I did. Memory consumption does not grow on otherwise idle host.
  
  Very good -- the checks in __call_rcu(), which is common code invoked by
  kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
  counter, which can be adjusted via rcutree.blimit, which defaults
  to taking evasive action if more than 10K callbacks are waiting on a
  given CPU.
  
  My concern was that you might be overrunning that limit in way less
  than a grace period (as in about a hundred microseconds.  My concern
  was of course unfounded -- you take several grace periods in push 10K
  callbacks through.
  
  Thanx, Paul
 
 Gleb noted that Documentation/RCU/checklist.txt has this text:
 
 An especially important property of the synchronize_rcu()
 primitive is that it automatically self-limits: if grace periods
 are delayed for whatever reason, then the synchronize_rcu()
 primitive will correspondingly delay updates.  In contrast,
 code using call_rcu() should explicitly limit update rate in
 cases where grace periods are delayed, as failing to do so can
 result in excessive realtime latencies or even OOM conditions.
 
 If call_rcu is self-limiting maybe this should be documented ...

It would be more accurate to say that takes has some measures to limit
the damage -- you can overwhelm these measures if you try hard enough.

And I guess I could say something to that effect.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan

On Tue, Nov 26, 2013 at 08:53:47PM +0200, Abel Gordon wrote:
 
 
 Anthony Liguori anth...@codemonkey.ws wrote on 26/11/2013 08:05:00 PM:
 
 
  Razya Ladelsky ra...@il.ibm.com writes:
 
   Hi all,
  
   I am Razya Ladelsky, I work at IBM Haifa virtualization team, which
   developed Elvis, presented by Abel Gordon at the last KVM forum:
   ELVIS video:  https://www.youtube.com/watch?v=9EyweibHfEs
   ELVIS slides:
 https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE
  
  
   According to the discussions that took place at the forum, upstreaming
   some of the Elvis approaches seems to be a good idea, which we would
 like
   to pursue.
  
   Our plan for the first patches is the following:
  
   1.Shared vhost thread between mutiple devices
   This patch creates a worker thread and worker queue shared across
 multiple
   virtio devices
   We would like to modify the patch posted in
   https://github.com/abelg/virtual_io_acceleration/commit/
  3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766
   to limit a vhost thread to serve multiple devices only if they belong
 to
   the same VM as Paolo suggested to avoid isolation or cgroups concerns.
  
   Another modification is related to the creation and removal of vhost
   threads, which will be discussed next.
 
  I think this is an exceptionally bad idea.
 
  We shouldn't throw away isolation without exhausting every other
  possibility.
 
 Seems you have missed the important details here.
 Anthony, we are aware you are concerned about isolation
 and you believe we should not share a single vhost thread across
 multiple VMs.  That's why Razya proposed to change the patch
 so we will serve multiple virtio devices using a single vhost thread
 only if the devices belong to the same VM. This series of patches
 will not allow two different VMs to share the same vhost thread.
 So, I don't see why this will be throwing away isolation and why
 this could be a exceptionally bad idea.
 
 By the way, I remember that during the KVM forum a similar
 approach of having a single data plane thread for many devices
 was discussed
  We've seen very positive results from adding threads.  We should also
  look at scheduling.
 
 ...and we have also seen exceptionally negative results from
 adding threads, both for vhost and data-plane. If you have lot of idle
 time/cores
 then it makes sense to run multiple threads. But IMHO in many scenarios you
 don't have lot of idle time/cores.. and if you have them you would probably
 prefer to run more VMs/VCPUshosting a single SMP VM when you have
 enough physical cores to run all the VCPU threads and the I/O threads is
 not a
 realistic scenario.
 
 That's why we are proposing to implement a mechanism that will enable
 the management stack to configure 1 thread per I/O device (as it is today)
 or 1 thread for many I/O devices (belonging to the same VM).
 
  Once you are scheduling multiple guests in a single vhost device, you
  now create a whole new class of DoS attacks in the best case scenario.
 
 Again, we are NOT proposing to schedule multiple guests in a single
 vhost thread. We are proposing to schedule multiple devices belonging
 to the same guest in a single (or multiple) vhost thread/s.
 

I guess a question then becomes why have multiple devices?


 
   2. Sysfs mechanism to add and remove vhost threads
   This patch allows us to add and remove vhost threads dynamically.
  
   A simpler way to control the creation of vhost threads is statically
   determining the maximum number of virtio devices per worker via a
 kernel
   module parameter (which is the way the previously mentioned patch is
   currently implemented)
  
   I'd like to ask for advice here about the more preferable way to go:
   Although having the sysfs mechanism provides more flexibility, it may
 be a
   good idea to start with a simple static parameter, and have the first
   patches as simple as possible. What do you think?
  
   3.Add virtqueue polling mode to vhost
   Have the vhost thread poll the virtqueues with high I/O rate for new
   buffers , and avoid asking the guest to kick us.
   https://github.com/abelg/virtual_io_acceleration/commit/
  26616133fafb7855cc80fac070b0572fd1aaf5d0
 
  Ack on this.
 
 :)
 
 Regards,
 Abel.
 
 
  Regards,
 
  Anthony Liguori
 
   4. vhost statistics
   This patch introduces a set of statistics to monitor different
 performance
   metrics of vhost and our polling and I/O scheduling mechanisms. The
   statistics are exposed using debugfs and can be easily displayed with a
 
   Python script (vhost_stat, based on the old kvm_stats)
   https://github.com/abelg/virtual_io_acceleration/commit/
  ac14206ea56939ecc3608dc5f978b86fa322e7b0
  
  
   5. Add heuristics to improve I/O scheduling
   This patch enhances the round-robin mechanism with a set of heuristics
 to
   decide when to leave a virtqueue and proceed to the next.
   https://github.com/abelg/virtual_io_acceleration/commit/

Re: Elvis upstreaming plan

2013-11-26 Thread Bandan Das

Razya Ladelsky ra...@il.ibm.com writes:

 Hi all,

 I am Razya Ladelsky, I work at IBM Haifa virtualization team, which 
 developed Elvis, presented by Abel Gordon at the last KVM forum: 
 ELVIS video:  https://www.youtube.com/watch?v=9EyweibHfEs 
 ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE 


 According to the discussions that took place at the forum, upstreaming 
 some of the Elvis approaches seems to be a good idea, which we would like 
 to pursue.

 Our plan for the first patches is the following: 

 1.Shared vhost thread between mutiple devices 
 This patch creates a worker thread and worker queue shared across multiple 
 virtio devices 
 We would like to modify the patch posted in
 https://github.com/abelg/virtual_io_acceleration/commit/3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766
  
 to limit a vhost thread to serve multiple devices only if they belong to 
 the same VM as Paolo suggested to avoid isolation or cgroups concerns.

 Another modification is related to the creation and removal of vhost 
 threads, which will be discussed next.

 2. Sysfs mechanism to add and remove vhost threads 
 This patch allows us to add and remove vhost threads dynamically.

 A simpler way to control the creation of vhost threads is statically 
 determining the maximum number of virtio devices per worker via a kernel 
 module parameter (which is the way the previously mentioned patch is 
 currently implemented)

Does the sysfs interface aim to let the _user_ control the maximum number of 
devices per vhost thread or/and let the user create and  destroy 
worker threads at will ?

Setting the limit on the number of devices makes sense but I am not sure
if there is any reason to actually expose an interface to create or destroy 
workers. Also, it might be worthwhile to think if it's better to just let 
the worker thread stay around (hoping it might be used again in 
the future) rather then destroying it..

 I'd like to ask for advice here about the more preferable way to go:
 Although having the sysfs mechanism provides more flexibility, it may be a 
 good idea to start with a simple static parameter, and have the first 
 patches as simple as possible. What do you think?

I am actually inclined more towards a static limit. I think that in a 
typical setup, the user will set this for his/her environment just once 
at load time and forget about it.

Bandan

 3.Add virtqueue polling mode to vhost 
 Have the vhost thread poll the virtqueues with high I/O rate for new 
 buffers , and avoid asking the guest to kick us.
 https://github.com/abelg/virtual_io_acceleration/commit/26616133fafb7855cc80fac070b0572fd1aaf5d0

 4. vhost statistics
 This patch introduces a set of statistics to monitor different performance 
 metrics of vhost and our polling and I/O scheduling mechanisms. The 
 statistics are exposed using debugfs and can be easily displayed with a 
 Python script (vhost_stat, based on the old kvm_stats)
 https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0


 5. Add heuristics to improve I/O scheduling 
 This patch enhances the round-robin mechanism with a set of heuristics to 
 decide when to leave a virtqueue and proceed to the next.
 https://github.com/abelg/virtual_io_acceleration/commit/f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d

 This patch improves the handling of the requests by the vhost thread, but 
 could perhaps be delayed to a 
 later time , and not submitted as one of the first Elvis patches.
 I'd love to hear some comments about whether this patch needs to be part 
 of the first submission.

 Any other feedback on this plan will be appreciated,
 Thank you,
 Razya

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] KVM: optimize apic interrupt delivery

2013-11-26 Thread Marcelo Tosatti

On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
 On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
  On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
   On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
On 09/12/2012 03:34 PM, Gleb Natapov wrote:
 On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
 On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
Paul, I'd like to check something with you here:
this function can be triggered by userspace,
any number of times; we allocate
a 2K chunk of memory that is later freed by
kfree_rcu.

Is there a risk of DOS if RCU is delayed while
lots of memory is queued up in this way?
If yes is this a generic problem with kfree_rcu
that should be addressed in core kernel?
   
   There is indeed a risk.
  
  In our case it's a 2K object. Is it a practical risk?
  
  How many kfree_rcu()s per second can a given user cause to happen?
 
 Not much more than a few hundred thousand per second per process 
 (normal
 operation is zero).
 
 I managed to do 21466 per second.

Strange, why so slow?

   Because ftrace buffer overflows :) With bigger buffer I get 169940.
  
  Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
  a tight loop, you could probably do in excess of 100M per CPU per second.
  Now -that- might be a problem.
  
  Well, it -might- be a problem if you somehow figured out how to allocate
  memory that quickly in a steady-state manner.  ;-)
  
 Good idea.  Michael, is should be easy to modify kvm-unit-tests to 
 write
 to the APIC ID register in a loop.
 
 I did. Memory consumption does not grow on otherwise idle host.
  
  Very good -- the checks in __call_rcu(), which is common code invoked by
  kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
  counter, which can be adjusted via rcutree.blimit, which defaults
  to taking evasive action if more than 10K callbacks are waiting on a
  given CPU.
  
  My concern was that you might be overrunning that limit in way less
  than a grace period (as in about a hundred microseconds.  My concern
  was of course unfounded -- you take several grace periods in push 10K
  callbacks through.
  
  Thanx, Paul
 
 Gleb noted that Documentation/RCU/checklist.txt has this text:
 
 An especially important property of the synchronize_rcu()
 primitive is that it automatically self-limits: if grace periods
 are delayed for whatever reason, then the synchronize_rcu()
 primitive will correspondingly delay updates.  In contrast,
 code using call_rcu() should explicitly limit update rate in
 cases where grace periods are delayed, as failing to do so can
 result in excessive realtime latencies or even OOM conditions.
 
 If call_rcu is self-limiting maybe this should be documented ...

The documentation should be fixed, rather, to not mention that
call_rcu() must be rate-limited by the user.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Marcelo Tosatti

On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
 On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
  On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
  xiaoguangr...@linux.vnet.ibm.com wrote:
 
  On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  snip complicated stuff about parent_pte
 
  I'm not really following, but note that parent_pte predates EPT (and
  the use of rcu in kvm), so all the complexity that is the result of
  trying to pack as many list entries into a cache line can be dropped.
  Most setups now would have exactly one list entry, which is handled
  specially antyway.
 
  Alternatively, the trick of storing multiple entries in one list entry
  can be moved to generic code, it may be useful to others.
  
  Yes, can the lockless list walking code be transformed into generic
  single-linked list walking? So the correctness can be verified
  independently, and KVM becomes a simple user of that interface.
 
 I'am afraid the signle-entry list is not so good as we expected. In my
 experience, there're too many entries on rmap, more than 300 sometimes.
 (consider a case that a lib shared by all processes).

single linked list was about moving singly-linked lockless walking
to generic code.

http://www.spinics.net/lists/linux-usb/msg39643.html
http://marc.info/?l=linux-kernelm=103305635013575w=3

  The simpler version is to maintain lockless walk on depth-1 rmap entries
  (and grab the lock once depth-2 entry is found).
 
 I still think rmap-lockless is more graceful: soft mmu can get benefit
 from it also it is promising to be used in some mmu-notify functions. :)

OK.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Marcelo Tosatti

On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
 
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
  This idea is based on the fact that the number of rmap is limited by
  RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
  we can break the rewalk at once, in the case of deleting, we can only
  rewalk RMAP_RECYCLE_THRESHOLD times.
  
  Please explain in more detail.
 
 Okay.
 
 My proposal is like this:
 
 pte_list_walk_lockless()
 {
 restart:
 
 + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 + return;
 
   code-doing-lockless-walking;
   ..
 }
 
 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.

The dirty bit could be set after the check.

 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

It can shrink and grow while lockless walk is performed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan

2013-11-26 Thread Jason Wang

On 11/24/2013 05:22 PM, Razya Ladelsky wrote:
Hi all,

According to the discussions that took place at the forum, upstreaming
some of the Elvis approaches seems to be a good idea, which we would like
to pursue.

Our plan for the first patches is the following:

to limit a vhost thread to serve multiple devices only if they belong to
the same VM as Paolo suggested to avoid isolation or cgroups concerns.

Another modification is related to the creation and removal of vhost
threads, which will be discussed next.

2. Sysfs mechanism to add and remove vhost threads
This patch allows us to add and remove vhost threads dynamically.

Any chance we can re-use the cwmq instead of inventing another
mechanism? Looks like there're lots of function duplication here. Bandan
has an RFC to do this.

Maybe we can make poll_stop_idle adaptive which may help the light load
case. Consider guest is often slow than vhost, if we just have one or
two vms, polling too much may waste cpu in this case.
4. vhost statistics
This patch introduces a set of statistics to monitor different performance
metrics of vhost and our polling and I/O scheduling mechanisms. The
statistics are exposed using debugfs and can be easily displayed with a
Python script (vhost_stat, based on the old kvm_stats)
https://github.com/abelg/virtual_io_acceleration/commit/ac14206ea56939ecc3608dc5f978b86fa322e7b0

How about using trace points instead? Besides statistics, it can also
help more in debugging.

Any other feedback on this plan will be appreciated,
Thank you,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Elvis upstreaming plan