Hello,

I was running the pi demo on multicores (Pentium Dual, Core i7) to see
it scale, but sometimes the time measurements return disparate results.
The FAQ suggests processor affinity as one possible reason for that. For
instance, the demo takes 3s in one core of the Pentium Dual
$ mpirun -h
mpirun (Open MPI) 1.4.2
...
$ mpirun -c 1 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs    Time (s)        Error
 1       2.963499        6.332712E-13

and approx. half the time in both cores, sometimes
$ mpirun -c 2 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs    Time (s)        Error
 2       1.535394        2.291500E-13
$ mpirun -c 2 pi
...
 2       2.497539        2.291500E-13

Say 6/10 times it takes 2.5s, 4/10 it takes 1.5s.

So I followed FAQ remarks about paffinity
$ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
...
 2       1.496346        2.291500E-13
$ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
...
 2       2.527654        2.291500E-13

Say 2/10 times it takes 2.5s. I'm not sure that's the way it's expected
to work, getting the not-bound-to-core time once in five tries.

I guess I'm doing something wrong. I tried to discover if the ranks are
being effectively bound. It seems they are.
$ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
[nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],0] to 
cpus 0001
[nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],1] to 
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.536590        2.291500E-13
$ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
[nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],0] to 
cpus 0001
[nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],1] to 
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       2.556353        2.291500E-13

Then I thought it might be related to ompi believing the node was being
oversubscribed and thus entering Degraded Mode instead of Agressive
(altough this pi demo has only one send-recv at the end of the code, so
it's hard to believe 1s diff is due to that). Cool this -nooversubscribe
switch, to quickly make sure about it
i$ mpirun -nooversubscribe -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:

If I correctly understand, slots can be only specified in a hostfile,
not in mpirun switches. Is that true?
$ cat hf
n0 slots=2 max-slots=2
$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 -report-bindings -c 3 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
...
cpd@nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi
[nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],0] to 
cpus 0001
[nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],1] to 
cpus 0002
^[[A
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.502448        2.291500E-13
cpd@nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi
[nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],0] to 
cpus 0001
[nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],1] to 
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       2.540400        2.291500E-13

Sometimes you need 10-15 tries to get 2.5s. That's much more reliable,
but I'm not sure I'm getting it right. Is it the expected way it works?
Is it normal to get the not-bound-to-core time once in 10-15 tries? Am I
missing something else?

I see there are also rankfiles, so I tried this
$ cat rf
rank 0=n0 slot=0:0
rank 1=n0 slot=0:1
cpd@nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
[nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],0] to 
slot_list 0:0
[nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],1] to 
slot_list 0:1
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       2.503292        2.291500E-13
cpd@nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
[nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],0] to 
slot_list 0:0
[nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],1] to 
slot_list 0:1
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.546936        2.291500E-13

I got 4/10 times 2.5s, and the same (well, 5/10) if I also include
-hostfile hf, so it seems I am still missing something. Probably that uncontrolled "something" is what the statistics 2/10 - 4/10 hang on.

And editing the files to point to n1, one of the Core i7 nodes...
cpd@nodo0:~/ompi$ grep cores /proc/cpuinfo
cpu cores       : 2
cpu cores       : 2
cpd@nodo0:~/ompi$ ssh n1 grep cores /proc/cpuinfo
cpu cores       : 4
cpu cores       : 4
cpu cores       : 4
cpu cores       : 4
$ cat hf
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ cat rf
rank 0=n1 slot=0:0
rank 1=n1 slot=0:1
rank 2=n1 slot=0:2
rank 3=n1 slot=0:3

The 3s sequential become 1.66s in this node
$ mpirun -H n1 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 1       1.662411        6.332712E-13
cpd@nodo0:~/ompi$ mpirun -H n1 -mca mpi_paffinity_alone 1 -report-bindings -c 1 
pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 1       1.663015        6.332712E-13

Hmf, didn't report the bindings since there are none, as I forgot to
mention the slots...

cpd@nodo0:~/ompi$ mpirun -H n1 -display-devel-map -mca mpi_paffinity_alone 1 
-report-bindings -c 1 pi

 Map generated by mapping policy: 0402
        Npernode: 0     Oversubscribe allowed: TRUE     CPU Lists: FALSE
...
 Data for node: Name: n1                Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 1    Num cores/socket: 2
...
        Num slots: 1    Slots in use: 1
        Num slots allocated: 1  Max slots: 0
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 1       1.663060        6.332712E-13

...I forgot to use the hostfile to mention the slots, so ompi thinks I'm
oversubscribing a 1-slot node. It also thinks this is a dual-core node,
like the one I'm typing mpirun on. I could fix the core count from
command line, but not the oversubscription (and consequently degraded
mode). 1.66s become 1.84s with 2 ranks. Hmm, now I think of it, it's 1s diff to the expected 0.84s...

$ mpirun -H n1 -num-cores 4 -display-devel-map -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi

 Map generated by mapping policy: 0402
        Npernode: 0     Oversubscribe allowed: TRUE     CPU Lists: FALSE
...
 Data for node: Name: n1                Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 1    Num cores/socket: 4
...
        Num slots: 1    Slots in use: 2
        Num slots allocated: 1  Max slots: 0
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.843984        2.291500E-13

It's indeed oversubscribed

$ mpirun -H n1 -nooversubscribe -num-cores 4 -display-devel-map -mca 
mpi_paffinity_alone 1 -report-bindings -c 2 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:


So here goes the hostfile
cpd@nodo0:~/ompi$ cat hf
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ mpirun -hostfile hf -nooversubscribe -num-cores 4 
-display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi

 Map generated by mapping policy: 0402
        Npernode: 0     Oversubscribe allowed: FALSE    CPU Lists: FALSE
...
 Data for node: Name: n1                Launch id: -1   Arch: 0 State: 0
        Num boards: 1   Num sockets/board: 1    Num cores/socket: 4
...
        Num slots: 4    Slots in use: 2
        Num slots allocated: 4  Max slots: 4
...
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.845798        2.291500E-13

Ouch! Where are my bindings? This did work for node n0!!!
cpd@nodo0:~/ompi$ cat hf1
n1 slots=4 max-slots=4
cpd@nodo0:~/ompi$ cat hf0
n0 slots=2 max-slots=2
cpd@nodo0:~/ompi$ mpirun -hostfile hf0 -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi
[nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],0] to 
cpus 0001
[nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],1] to 
cpus 0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.529327        2.291500E-13
cpd@nodo0:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.842951        2.291500E-13

Ok, I'm moving to node n1
cpd@nodo0:~/ompi$ ssh n1
cpd@nodo1:~$ cd ompi
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1 
-report-bindings -c 2 pi
[nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],0] to cpus 
0001
[nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],1] to cpus 
0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.841935        2.291500E-13
cpd@nodo1:~/ompi$ cat hf1
n1 slots=4 max-slots=4
cpd@nodo1:~/ompi$ grep cores /proc/cpuinfo
cpu cores       : 4
cpu cores       : 4
cpu cores       : 4
cpu cores       : 4
cpd@nodo1:~/ompi$ mpirun pi
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 1       1.661314        6.332712E-13

Hmf, with 2 cores I get 1.84s 10/10 times. These results below are 10/10, perfectly reproducible without a glitch. -bind-to-core is required to get -report-bindings to report something. Is it different from -mca mpi_paffinity_alone 1? I find it easier to use (shorter to type in).
   -bind-to-core|--bind-to-core
                         Whether to bind processes to specific cores (the
                         default)

cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 2 pi
[nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],0] to cpus 
0001
[nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],1] to cpus 
0002
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 2       1.841275        2.291500E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 3 pi
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],0] to cpus 
0001
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],1] to cpus 
0002
[nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],2] to cpus 
0004
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 3       1.571870        3.606004E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],0] to cpus 
0001
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],1] to cpus 
0002
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],2] to cpus 
0004
[nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],3] to cpus 
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 4       1.436920        4.236611E-13
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 5 pi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 5 slots
that were requested by the application:

Hmm ,wait, I got this one on a second try (so it's really 1/20)
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],0] to cpus 
0001
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],1] to cpus 
0002
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],2] to cpus 
0004
[nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],3] to cpus 
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 4       2.437651        4.236611E-13
Hmm, yup, I see it: again 1s diff with the expected 1.43. No clue.

The full devel-map for the -c 4 run (making sure with -nooversubscribe) is
cpd@nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -nooversubscribe 
-report-bindings -display-devel-map -c 4 pi

 Map generated by mapping policy: 0402
        Npernode: 0     Oversubscribe allowed: FALSE    CPU Lists: FALSE
        Num new daemons: 0      New daemon starting vpid INVALID
        Num nodes: 1

 Data for node: Name: nodo1             Launch id: -1   Arch: ffc91200  State: 2
        Num boards: 1   Num sockets/board: 1    Num cores/socket: 4
        Daemon: [[9168,0],0]    Daemon launched: True
        Num slots: 4    Slots in use: 4
        Num slots allocated: 4  Max slots: 4
        Username on node: NULL
        Num procs: 4    Next node_rank: 4
        Data for proc: [[9168,1],0]
                Pid: 0  Local rank: 0   Node rank: 0
                State: 0        App_context: 0  Slot list: NULL
        Data for proc: [[9168,1],1]
                Pid: 0  Local rank: 1   Node rank: 1
                State: 0        App_context: 0  Slot list: NULL
        Data for proc: [[9168,1],2]
                Pid: 0  Local rank: 2   Node rank: 2
                State: 0        App_context: 0  Slot list: NULL
        Data for proc: [[9168,1],3]
                Pid: 0  Local rank: 3   Node rank: 3
                State: 0        App_context: 0  Slot list: NULL
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],0] to cpus 
0001
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],1] to cpus 
0002
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],2] to cpus 
0004
[nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],3] to cpus 
0008
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 4       1.452407        4.236611E-13

And this is using the rankfile
cpd@nodo1:~/ompi$ cat rf
rank 0=n1 slot=0:0
rank 1=n1 slot=0:1
rank 2=n1 slot=0:2
rank 3=n1 slot=0:3
cpd@nodo1:~/ompi$ mpirun -rf rf -report-bindings -c 4 pi
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],0] to 
slot_list 0:0
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],1] to 
slot_list 0:1
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],2] to 
slot_list 0:2
[nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],3] to 
slot_list 0:3
# Estimation of pi is 3.141593 after 1e+08 iterations
# PCs   Time (s)        Error
 4       1.435335        4.236611E-13

So I thought I'd better ask for help, and following mailing-list experience, have yet another coffee meanwhile :-)

Are these results normal? Why can't I get results similar to the Pentium dual on the Core i7? I would have expected to see some 0.84s with -c 2 every now and then, and very frequently if using -bind-to-core. I realize I don't have a clear concept of when/wheter hostfile/slots are required, or if -num-cores with -bind-to-core should have worked fine. Is there any (mpirun switch-mca parameter) I'm missing that would provide 10/10 results in the Pentium dual, and/or let the pi application scale well on the Core i7?

Thanks in advance for any suggestions.

Reply via email to