Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
 On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
  On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
   Hi:
   
   Thanks for the work and it looks very reasonable, some questions
   below.
 
 Yes I am happy to see the per-cpu work resurrected.
 Some comments below.
Glad to see you have time on reviewing this.

   On 03/23/2012 07:48 AM, Shirley Ma wrote:
Sorry for being late to submit this patch. I have spent lots of
 time
trying to find the best approach. This effort is still going
 on...
   
This patch is built against net-next tree.
   
This is an experimental RFC patch. The purpose of this patch is
 to
address KVM networking scalability and NUMA scheduling issue.
   
   Need also test for non-NUMA machine, I see that you just choose
 the
   cpu 
   that initiates the work for non-numa machine which seems sub
 optimal.
  
  Good suggestions. I don't have any non-numa systems. But KK run some
  tests on non-numa system. He could see around 20% performance gain
 for
  single VMs local host to guest. I hope we can run a full test on
  non-numa system.
  
  On non-numa system, the same per vhost-cpu thread will be always
 picked
  up consistently for a particular vq since all cores are on same cpu
  socket. So there will be two per-cpu vhost threads handle TX/RX
  simultaneously.
  
The existing implementation of vhost creats a vhost thread
   per-device
(virtio_net) based. RX and TX work of a VMs per-device is
 handled by
same vhost thread.
   
One of the limitation of this implementation is with increasing
 the
number VMs or the number of virtio-net interfces, more vhost
 threads
   are
created, it will consume more kernel resources, and induce more
   threads
context switches/scheduling overhead. We noticed that the KVM
   network
performance doesn't scale with increasing number of VMs.
   
The other limitation is to have single vhost thread to process
 both
   RX
and TX, the work will be blocked. So we create this per cpu
 vhost
   thread
implementation. The number of vhost cpu threads is limited to
 the
   number
of cpus on the host.
   
To address these limitations, we are propsing a per-cpu vhost
 thread
model where the number of vhost threads are limited and equal to
 the
number of online cpus on the host.
   
   The number of vhost thread needs more consideration. Consider that
 we 
   have a 1024 cores host with a card have 16 tx/rx queues, do we
 really 
   need 1024 vhost threads?
  
  In this case, we could add a module parameter to limit the number of
  cores/sockets to be used.
 
 Hmm. And then which cores would we run on?
 Also, is the parameter different between guests?
 Another idea is to scale the # of threads on demand.

If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?


 Sharing the same thread between guests is also an
 interesting approach, if we did this then per-cpu
 won't be so expensive but making this work well
 with cgroups would be a challenge.

Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.

It's challenge to work with cgroups anyway.

 
   
Based on our testing experience, the vcpus can be scheduled
 across
   cpu
sockets even when the number of vcpus is smaller than the number
 of
cores per cpu socket and there is no other  activities besides
 KVM
networking workload. We found that if vhost thread is scheduled
 on
   the
same socket as the work is received, the performance will be
 better.
   
So in this per cpu vhost thread implementation, a vhost thread
 is
selected dynamically based on where the TX/RX work is initiated.
 A
   vhost
thread on the same cpu socket is selected but not on the same
 cpu as
   the
vcpu/interrupt thread that initizated the TX/RX work.
   
When we test this RFC patch, the other interesting thing we
 found is
   the
performance results also seem related to NIC flow steering. We
 are
spending time on evaluate different NICs flow director
   implementation
now. We will enhance this patch based on our findings later.
   
We have tried different scheduling: per-device based, per vq
 based
   and
per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
   scheduling,
we found that so far the per vq based scheduling is good enough
 for
   now.
   
   Could you please explain more about those scheduling strategies?
 Does 
   per-device based means let a dedicated vhost thread to handle all
   work 
   from that vhost device? As you mentioned, maybe an improvement of
 the 
   scheduling to take flow steering info (queue mapping, rxhash etc.)
 of 
   skb in host into account.
  
  Yes, per-device scheduling means one per-cpu vhost theads handle all
  works from one particular vhost-device.
  
  Yes, we think 

Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 08:22 -0700, Shirley Ma wrote:

 Haven't had time to focus on single stream result yet. 

I forgot to mention that if I switch the vhost scheduling to per-device
based from per vq based, this minor single stream test regression will
be gone. However the improvement of tcp_rrs, udp_rrs and other stream
case performance will also gone.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html