Re: [OMPI users] MPI processes swapping out

George Bosilca Fri, 23 Mar 2007 19:15:40 -0400

So far the described behavior seems as normal as expected. As OpenMPI never goes in blocking mode, the processes will always spinbetween active and sleep mode. More processes on the same node leadsto more time in the system mode (because of the empty polls). Thereis a trick in the trunk version of Open MPI which will trigger theblocking mode if and only if TCP is the only used device. Please tryadd "--mca btl tcp,self" to your mpirun command line, and check theoutput of vmstat.


  Thanks,
    george.


On Mar 23, 2007, at 3:32 PM, Heywood, Todd wrote:

Rolf,
Is it possible that everything is working just as it should?
That's what I'm afraid of :-). But I did not expect to see such
communication overhead due to blocking from mpiBLAST, which is very
course-grained. I then tried HPL, which is computation-heavy, andfound thesame thing. Also, the system time seemed to correspond to the MPIprocessescycling between run and sleep (as seen via top), and I thought thatsetting
the mpi_yield_when_idle parameter to 0 would keep the processes from
entering sleep state when blocking. But it doesn't.

Todd



On 3/23/07 2:06 PM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote:
Todd:

I assume the system time is being consumed by
the calls to send and receive data over the TCP sockets.
As the number of processes in the job increases, then more
time is spent waiting for data from one of the other processes.

I did a little experiment on a single node to see the difference
in system time consumed when running over TCP vs when
running over shared memory.   When running on a single
node and using the sm btl, I see almost 100% user time.
I assume this is because the sm btl handles sending and
receiving its data within a shared memory segment.
However, when I switch over to TCP, I see my system time
go up.  Note that this is on Solaris.

RUNNING OVER SELF,SM
mpirun -np 8 -mca btl self,sm hpcc.amd64
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIGPROCESS/NLWP3505 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 75 182 0hpcc.amd64/13503 rolfv 100 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 69 116 0hpcc.amd64/13499 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 106 236 0hpcc.amd64/13497 rolfv 99 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 169 200 0hpcc.amd64/13501 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 1.9 0 127 158 0hpcc.amd64/13507 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 244 200 0hpcc.amd64/13509 rolfv 98 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0 282 212 0hpcc.amd64/13495 rolfv 97 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0 237 98 0hpcc.amd64/1
RUNNING OVER SELF,TCP
mpirun -np 8 -mca btl self,tcp hpcc.amd64
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIGPROCESS/NLWP4316 rolfv 93 6.9 0.0 0.0 0.0 0.0 0.0 0.2 5 346 .6M 0hpcc.amd64/14328 rolfv 91 8.4 0.0 0.0 0.0 0.0 0.0 0.4 3 59 .15 0hpcc.amd64/14324 rolfv 98 1.1 0.0 0.0 0.0 0.0 0.0 0.7 2 270 .1M 0hpcc.amd64/14320 rolfv 88 12 0.0 0.0 0.0 0.0 0.0 0.8 4 244 .15 0hpcc.amd64/14322 rolfv 94 5.1 0.0 0.0 0.0 0.0 0.0 1.3 2 150 .2M 0hpcc.amd64/14318 rolfv 92 6.7 0.0 0.0 0.0 0.0 0.0 1.4 5 236 .9M 0hpcc.amd64/14326 rolfv 93 5.3 0.0 0.0 0.0 0.0 0.0 1.7 7 117 .2M 0hpcc.amd64/14314 rolfv 91 6.6 0.0 0.0 0.0 0.0 1.3 0.9 19 150 .10 0hpcc.amd64/1
I also ran HPL over a larger cluster of 6 nodes, and noticed evenhigher
system times.
And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2procs
per node
using Sun HPC ClusterTools 6, and saw about a 50/50 split betweenuser
and system time.
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIGPROCESS/NLWP
 11525 rolfv     55  44 0.1 0.0 0.0 0.0 0.1 0.4  76 960 .3M   0
maxtrunc_ct6/1
 11526 rolfv     54  45 0.0 0.0 0.0 0.0 0.0 1.0   0 362 .4M   0
maxtrunc_ct6/1

Is it possible that everything is working just as it should?

Rolf

Heywood, Todd wrote On 03/22/07 13:30,:
Ralph,
Well, according to the FAQ, aggressive mode can be "forced" so Idid trysetting OMPI_MCA_mpi_yield_when_idle=0 before running. I alsotried turningprocessor/memory affinity on. Efffects were minor. The MPI tasksstill cyclebewteen run and sleep states, driving up system time well overuser time.
Mpstat shows SGE is indeed giving 4 or 2 slots per node asapproporiate(depending on memory) and the MPI tasks are using 4 or 2 cores,but to besure, I also tried running directly with a hostfile with slots=4or slots=2.
The same behavior occurs.
This behavior is a function of the size of the job. I.e. As Iscale from 200to 800 tasks the run/sleep cycling increases, so that system timegrows from
maybe half the user time to maybe 5 times user time.

This is for TCP/gigE.

Todd


On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote:
Just for clarification: ompi_info only shows the *default* valueof the MCAparameter. In this case, mpi_yield_when_idle defaults toaggressive, butthat value is reset internally if the system sees an"oversubscribed"
condition.
The issue here isn't how many cores are on the node, but ratherhow manywere specifically allocated to this job. If the allocationwasn't at least 2(in your example), then we would automatically resetmpi_yield_when_idle tobe non-aggressive, regardless of how many cores are actually onthe node.
Ralph


On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
Yes, I'm using SGE. I also just noticed that when 2 tasks/slotsrun on a4-core node, the 2 tasks are still cycling between run andsleep, with
higher system time than user time.
Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0(aggressive),
so that suggests the tasks aren't swapping out on bloccking calls.

Still puzzled.

Thanks,
Todd


On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
Are you using a scheduler on your system?
More specifically, does Open MPI know that you have forprocess slots
on each node?  If you are using a hostfile and didn't specify
"slots=4" for each host, Open MPI will think that it's
oversubscribing and will therefore call sched_yield() in thedepths
of its progress engine.


On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
P.s. I should have said this this is a pretty course-grained
application,
and netstat doesn't show much communication going on (except in
stages).


On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
I noticed that my OpenMPI processes are using larger amounts of
system time
than user time (via vmstat, top). I'm running on dual-core,dual-CPUOpterons, with 4 slots per node, where the program has thenodes to
themselves. A closer look showed that they are constantly
switching between
run and sleep states with 4-8 page faults per second.

Why would this be? It doesn't happen with 4 sequential jobs
running on a
node, where I get 99% user time, maybe 1% system time.
The processes have plenty of memory. This behavior occurswhether
I use
processor/memory affinity or not (there is nooversubscription).
Thanks,

Todd

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI processes swapping out

Reply via email to