On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
Hi:
Thanks for the work and it looks very reasonable, some questions
below.
Yes I am happy to see the per-cpu work resurrected.
Some comments below.
Glad to see you have time on reviewing this.
On 03/23/2012 07:48 AM, Shirley Ma wrote:
Sorry for being late to submit this patch. I have spent lots of
time
trying to find the best approach. This effort is still going
on...
This patch is built against net-next tree.
This is an experimental RFC patch. The purpose of this patch is
to
address KVM networking scalability and NUMA scheduling issue.
Need also test for non-NUMA machine, I see that you just choose
the
cpu
that initiates the work for non-numa machine which seems sub
optimal.
Good suggestions. I don't have any non-numa systems. But KK run some
tests on non-numa system. He could see around 20% performance gain
for
single VMs local host to guest. I hope we can run a full test on
non-numa system.
On non-numa system, the same per vhost-cpu thread will be always
picked
up consistently for a particular vq since all cores are on same cpu
socket. So there will be two per-cpu vhost threads handle TX/RX
simultaneously.
The existing implementation of vhost creats a vhost thread
per-device
(virtio_net) based. RX and TX work of a VMs per-device is
handled by
same vhost thread.
One of the limitation of this implementation is with increasing
the
number VMs or the number of virtio-net interfces, more vhost
threads
are
created, it will consume more kernel resources, and induce more
threads
context switches/scheduling overhead. We noticed that the KVM
network
performance doesn't scale with increasing number of VMs.
The other limitation is to have single vhost thread to process
both
RX
and TX, the work will be blocked. So we create this per cpu
vhost
thread
implementation. The number of vhost cpu threads is limited to
the
number
of cpus on the host.
To address these limitations, we are propsing a per-cpu vhost
thread
model where the number of vhost threads are limited and equal to
the
number of online cpus on the host.
The number of vhost thread needs more consideration. Consider that
we
have a 1024 cores host with a card have 16 tx/rx queues, do we
really
need 1024 vhost threads?
In this case, we could add a module parameter to limit the number of
cores/sockets to be used.
Hmm. And then which cores would we run on?
Also, is the parameter different between guests?
Another idea is to scale the # of threads on demand.
If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?
Sharing the same thread between guests is also an
interesting approach, if we did this then per-cpu
won't be so expensive but making this work well
with cgroups would be a challenge.
Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.
It's challenge to work with cgroups anyway.
Based on our testing experience, the vcpus can be scheduled
across
cpu
sockets even when the number of vcpus is smaller than the number
of
cores per cpu socket and there is no other activities besides
KVM
networking workload. We found that if vhost thread is scheduled
on
the
same socket as the work is received, the performance will be
better.
So in this per cpu vhost thread implementation, a vhost thread
is
selected dynamically based on where the TX/RX work is initiated.
A
vhost
thread on the same cpu socket is selected but not on the same
cpu as
the
vcpu/interrupt thread that initizated the TX/RX work.
When we test this RFC patch, the other interesting thing we
found is
the
performance results also seem related to NIC flow steering. We
are
spending time on evaluate different NICs flow director
implementation
now. We will enhance this patch based on our findings later.
We have tried different scheduling: per-device based, per vq
based
and
per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
scheduling,
we found that so far the per vq based scheduling is good enough
for
now.
Could you please explain more about those scheduling strategies?
Does
per-device based means let a dedicated vhost thread to handle all
work
from that vhost device? As you mentioned, maybe an improvement of
the
scheduling to take flow steering info (queue mapping, rxhash etc.)
of
skb in host into account.
Yes, per-device scheduling means one per-cpu vhost theads handle all
works from one particular vhost-device.
Yes, we think