Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 08:22 -0700, Shirley Ma wrote:

> Haven't had time to focus on single stream result yet. 

I forgot to mention that if I switch the vhost scheduling to per-device
based from per vq based, this minor single stream test regression will
be gone. However the improvement of tcp_rrs, udp_rrs and other stream
case performance will also gone.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

2012-04-05 Thread Shirley Ma
On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
> > On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > Thanks for the work and it looks very reasonable, some questions
> > > below.
> 
> Yes I am happy to see the per-cpu work resurrected.
> Some comments below.
Glad to see you have time on reviewing this.

> > > On 03/23/2012 07:48 AM, Shirley Ma wrote:
> > > > Sorry for being late to submit this patch. I have spent lots of
> time
> > > > trying to find the best approach. This effort is still going
> on...
> > > >
> > > > This patch is built against net-next tree.
> > > >
> > > > This is an experimental RFC patch. The purpose of this patch is
> to
> > > > address KVM networking scalability and NUMA scheduling issue.
> > > 
> > > Need also test for non-NUMA machine, I see that you just choose
> the
> > > cpu 
> > > that initiates the work for non-numa machine which seems sub
> optimal.
> > 
> > Good suggestions. I don't have any non-numa systems. But KK run some
> > tests on non-numa system. He could see around 20% performance gain
> for
> > single VMs local host to guest. I hope we can run a full test on
> > non-numa system.
> > 
> > On non-numa system, the same per vhost-cpu thread will be always
> picked
> > up consistently for a particular vq since all cores are on same cpu
> > socket. So there will be two per-cpu vhost threads handle TX/RX
> > simultaneously.
> > 
> > > > The existing implementation of vhost creats a vhost thread
> > > per-device
> > > > (virtio_net) based. RX and TX work of a VMs per-device is
> handled by
> > > > same vhost thread.
> > > >
> > > > One of the limitation of this implementation is with increasing
> the
> > > > number VMs or the number of virtio-net interfces, more vhost
> threads
> > > are
> > > > created, it will consume more kernel resources, and induce more
> > > threads
> > > > context switches/scheduling overhead. We noticed that the KVM
> > > network
> > > > performance doesn't scale with increasing number of VMs.
> > > >
> > > > The other limitation is to have single vhost thread to process
> both
> > > RX
> > > > and TX, the work will be blocked. So we create this per cpu
> vhost
> > > thread
> > > > implementation. The number of vhost cpu threads is limited to
> the
> > > number
> > > > of cpus on the host.
> > > >
> > > > To address these limitations, we are propsing a per-cpu vhost
> thread
> > > > model where the number of vhost threads are limited and equal to
> the
> > > > number of online cpus on the host.
> > > 
> > > The number of vhost thread needs more consideration. Consider that
> we 
> > > have a 1024 cores host with a card have 16 tx/rx queues, do we
> really 
> > > need 1024 vhost threads?
> > 
> > In this case, we could add a module parameter to limit the number of
> > cores/sockets to be used.
> 
> Hmm. And then which cores would we run on?
> Also, is the parameter different between guests?
> Another idea is to scale the # of threads on demand.

If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?


> Sharing the same thread between guests is also an
> interesting approach, if we did this then per-cpu
> won't be so expensive but making this work well
> with cgroups would be a challenge.

Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.

It's challenge to work with cgroups anyway.

> 
> > > >
> > > > Based on our testing experience, the vcpus can be scheduled
> across
> > > cpu
> > > > sockets even when the number of vcpus is smaller than the number
> of
> > > > cores per cpu socket and there is no other  activities besides
> KVM
> > > > networking workload. We found that if vhost thread is scheduled
> on
> > > the
> > > > same socket as the work is received, the performance will be
> better.
> > > >
> > > > So in this per cpu vhost thread implementation, a vhost thread
> is
> > > > selected dynamically based on where the TX/RX work is initiated.
> A
> > > vhost
> > > > thread on the same cpu socket is selected but not on the same
> cpu as
> > > the
> > > > vcpu/interrupt thread that initizated the TX/RX work.
> > > >
> > > > When we test this RFC patch, the other interesting thing we
> found is
> > > the
> > > > performance results also seem related to NIC flow steering. We
> are
> > > > spending time on evaluate different NICs flow director
> > > implementation
> > > > now. We will enhance this patch based on our findings later.
> > > >
> > > > We have tried different scheduling: per-device based, per vq
> based
> > > and
> > > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
> > > scheduling,
> > > > we found that so far the per vq based scheduling is good enough
> for
> > > now.
> > > 
> > > Could you please explain more about those scheduling strategies?
> Does 
> > > per-device based means let a dedicated vhost