Re: [OMPI users] Open MPI and Dual Core (machinefile)

Troy Telford Fri, 2 Jun 2006 11:15:17 -0400

On Thu, 01 Jun 2006 18:07:07 -0600, Jeff Squyres (jsquyres)<jsquy...@cisco.com> wrote:

This *sounds* like the classic oversubscription problem: Open MPI's
aggressive vs. degraded operating modes:


http://www.open-mpi.org/faq/?category=running#oversubscribing


Good link; bookmarked for (internal) documentation...

Specifically, "slots" is *not* meant to be the number of processes to
run.  It's meant to be how many processors are available to run.  Hence,
if you lie and tell OMPI that you have more slots than CPUs, OMPI will
think that it can run in aggressive mode.  But you'll have less
processors than processes, and all of them will be running in aggressive
mode -- hence, massive slowdown.

However, you say that you've got 2 dual core opterons in a single box,
so there should be 4 processors.  Hence "slots=4" should not be a lie.

It's good to hear that my concept of slots wasn't off. (Although mymessage didn't give that impression...) It certainly seems to me thatwith two dual cores I should use slots=4.

I can't think of why this would happen.

Can you confirm that your Linux installation thinks that it has 4
processors and will schedule 4 processes simultaneously?

Fun story: At first, *I* thought it was a simple case of two single-coreprocessors. (slots=2, and I used two nodes to get 4 CPUs) I believed ithad only two processors because `cat /proc/cpuinfo` would list twoprocessors: CPU0 and CPU1. (ie. the Linux installation doesn't see fourprocessors, but two dual-core processors.)

Then somebody pointed out to me they were dual core, and that cpuinfolisted it:

******
processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : unknown
stepping        : 2
cpu MHz         : 2613.419
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2     <----- Two cores -------
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mcacmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm3dnowext 3dnow pni lahf_lm

bogomips        : 5227.16
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
******
To verify that it acted like it had four cores, I tried the following:
(using two nodes in the machinefile, each with slots=2)

1.) Start a 4 CPU linpack job. (Supposedly using half of the CPU powerin each machine)* With just 4 processes in total, the problem size took approximately0.08 s to finish (repeatably; the HPL.dat is set to run several of thesame problem size.)* 'top' listed *two* CPU's, both pegged at 100%. Each hpl processwas taking 100% of the CPU.2.) Start a second 4 CPU linpack job (using the other half of the CPUpower)* When I started the second job (8 total processes, 4 in each job),the same problem size started to take 0.19 s to complete (on both jobs)* 'top' listed *two* CPU's, both pegged at 100%. Each hpl processwas taking 50% of the CPU.

************

Then, I tried the same 4 process linpack job on a single node (one node inthe machinefile, slots=2)The results were essentially identical to #2 above (where the node wasstill running 4 processes)

So it seems that although the system has dual-core CPU's, only one core isbeing used per CPU; so four simultaneous processes are not being scheduled.

So the oversubscription hypothesis appears to be 100% correct; slots=4 isoversubscribing the job.

Now I get to go find out *why* the job is oversubscribed, since there are4 cores able to handle the process... I'll have to see if the systembehaves similarly with non-mpi processes (ie. it doesn't use all of theavailable cores). It may very well be a problem with the hardware or OS;it's the pre-release distro I wrote about in another posting yesterday...

I'm wondering if there is something happening behind the scenes... I'llhave to check...

--
Troy Telford

Re: [OMPI users] Open MPI and Dual Core (machinefile)

Reply via email to