Hi,
  I'm profiling memcpy and seeing strange behavior (for me at least) and
wanted to see if some one has an idea what may be happening.

My set up is as follows:

I have a Ubuntu 12.04 Linux host running 3.2.0-23 kernel. It has a four
10-core dual-hyper-threaded CPU with 128GB RAM. I have instantiated a VM
running same OS as the host with KVM enabled. VM is provided with 16 vCPUs
with 6GB guest memory. The QEMU ( qemu-2.3.0-rc3) process is bound on vCPUs
1-16 (included) using taskset. (The idea is each lwp that QEMU spins per
vCPU runs on separate hyper-thread)

taskset -pc 1 16 $qemupid


An application on this VM mallocs two 1GB chunks to be used as source and
destination (virtual address at 256 byte boundary). This application spins
of *n* threads each performing memcpy in parallel. Each thread is again
bound to a vCPU using *pthread_setaffinity_np()* inside the guest
application in a round-robin fashion. The size of each memcpy is 32 MB. I
experimented with 4 .. 32 threads in increments of 4. Each thread works on
different slice, where slice size is 32 MB.

src = bufaligned1 + (slice_sz * j);

dst = bufaligned2 + (slice_sz * j);


I noticed as the number of thread increases, the time taken to perform
memcpy of 32 MB increased too. Time measured in milli-seconds is as given
below (Threads - Transfer time)

04 - 52
08 - 79
12 - 148
16 - 180
20 - 223
24 - 270
28 - 302
32 - 354

I was expecting to see the transfer time to be approximately same up until
16 threads as I have 16 vCPUs bound to 16 hyper threads. After 16 threads
the time to transfer should increase as multiple threads are being
scheduled on the same vCPU and there is resource contention. But why is
transfer time increasing between 4-16 threads too ? Looks like there is
some contention at the host level. Any ideas what that could be so that I
can focus on that and profile that component ?

Thanks
Shesha

Reply via email to