Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
Interesting point.

--- On Thu, 12/8/10, Ashley Pittman  wrote:

From: Ashley Pittman 
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" 
Received: Thursday, 12 August, 2010, 12:22 AM


On 11 Aug 2010, at 05:10, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.

Going off-topic here but I think it's worth saying:

If you have a dataset that requires collective communication then use the 
function call that best matches what you are trying to do, far to many people 
try and re-implement the collectives in their own code and it nearly always 
goes badly, as someone who's spent many years implementing collectives I've 
lost count of the number of times I've made someones code go faster by 
replacing 500+ lines of code with a single call to MPI_Gather().

In the rare case that you find that some collectives are slower than they 
should be for your specific network and message size then the best thing to do 
is to work with the Open-MPI developers to tweak the thresholds so a better 
algorithm gets picked by the library.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
I (a single user) am running N separate MPI  applications doing 1 to N 
broadcasts over PVM, each MPI application is started on each machine 
simultaneously by PVM - the reasons are back in the post history.

The problem is that they somehow collide - yes I know this should not happen, 
the question is why.

--- On Wed, 11/8/10, Richard Treumann  wrote:

From: Richard Treumann 
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" 
Received: Wednesday, 11 August, 2010, 11:34 PM



Randolf 



I am confused about using multiple,
concurrent mpirun operations.  If there are M uses of mpirun and each
starts N tasks (carried out under pvm or any other way) I would expect
you to have M completely independent MPI jobs with N tasks (processes)
each.  You could have some root in each of the M MPI jobs do an MPI_Bcast
to the other N-1) in that job but there is no way in MPI (without using
accept.connect) to get tasks of job 0 to give data to tasks of jobs 1-(m-1).



With M uses of mpirun, you have M worlds
that are forever isolated from the other M-1 worlds (again, unless you
do accept/connect)



In what sense are you treating this
as an single MxN application?   ( I use M & N to keep them distinct.
I assume if M == N, we have your case)





Dick Treumann  -  MPI Team
          

IBM Systems & Technology Group

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601

Tele (845) 433-7846         Fax (845) 433-8363


-Inline Attachment Follows-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

Re: [OMPI users] Hyper-thread architecture effect on MPI jobs

2010-08-11 Thread Eugene Loh




The way MPI processes are being assigned to hardware threads is perhaps
neither controlled nor optimal.  On the HT nodes, two processes may end
up sharing the same core, with poorer performance.

Try submitting your job like this

% cat myrankfile1
rank  0=os223 slot=0
rank  1=os221 slot=0
rank  2=os222 slot=0
rank  3=os224 slot=0
rank  4=os228 slot=0
rank  5=os229 slot=0
rank  6=os223 slot=1
rank  7=os221 slot=1
rank  8=os222 slot=1
rank  9=os224 slot=1
rank 10=os228 slot=1
rank 11=os229 slot=1
rank 12=os223 slot=2
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=2
rank 17=os229 slot=2
% mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
myrankfile1 ./a.out

You can also try

% cat myrankfile2
rank  0=os223 slot=0
rank  1=os221 slot=0
rank  2=os222 slot=0
rank  3=os224 slot=0
rank  4=os228 slot=0
rank  5=os229 slot=0
rank  6=os223 slot=1
rank  7=os221 slot=1
rank  8=os222 slot=1
rank  9=os224 slot=1
rank 10=os228 slot=2
rank 11=os229 slot=2
rank 12=os223 slot=2
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=4
rank 17=os229 slot=4
% mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
myrankfile2 ./a.out

which one reproduces your problem and which one avoids it depends on
how the BIOS numbers your HTs.  Once you can confirm you understand the
problem, you (with the help of this list) can devise a solution
approach for your situation.


Saygin Arkan wrote:
Hello,
  
I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have
the following properties, os221, os222, os223, os224:
  
  vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
stepping    : 7
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips    : 4999.40
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
  
and the problematic, hyper-threaded 2 machines are as follows, os228
and os229:
  
  vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping    : 5
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
ida
bogomips    : 5396.88
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
  
  
The problem is: those 2 machines seem to be having 8 cores (virtually,
actualy core number is 4).
When I submit an MPI job, I calculated the comparison times in the
cluster. I got strange results.
  
I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can
say 1/3 of the tests) os228 or os229 returns strange results. 2 cores
are slow (slower than the first 4 nodes) but the 3rd core is extremely
fast.
  
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing
Times...
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4)   
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5)   
:34 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7)   
:39 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8)   
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9)   
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228
RANK(10)    :48 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229
RANK(11)    :35 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223
RANK(12)    :38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221
RANK(13)    :37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222
RANK(14)    :37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224
RANK(15)    :38 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228
RANK(16)    :43 sec

Re: [OMPI users] Hyper-thread architecture effect on MPI jobs

2010-08-11 Thread Gus Correa

Hi Saygin

You could:

1) turn off hyperthreading (on BIOS), or

2) use the mpirun options (you didn't send your mpirun command)
to distribute the processes across the nodes, cores, etc.
"man mpirun" is a good resource, see the explanations about
the -byslot, -bynode, -loadbalance options.

3) In addition, you can use the mca parameters to set processor affinity
in the mpirun command line "mpirun -mca mpi_paffinity_alone 1 ..."
I don't know how this will play in a hyperthreaded machine,
but it works fine in our dual processor quad-core computers
(not hyperthreaded).

Depending on your code, hyperthreading may not help performance anyway.

I hope this helps,
Gus Correa

Saygin Arkan wrote:

Hello,

I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have 
the following properties, os221, os222, os223, os224:


vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
stepping: 7
cache size  : 3072 KB
physical id : 0
siblings: 4
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm

bogomips: 4999.40
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual

and the problematic, hyper-threaded 2 machines are as follows, os228 and 
os229:


vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping: 5
cache size  : 8192 KB
physical id : 0
siblings: 8
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni 
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida

bogomips: 5396.88
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual


The problem is: those 2 machines seem to be having 8 cores (virtually, 
actualy core number is 4).
When I submit an MPI job, I calculated the comparison times in the 
cluster. I got strange results.


I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can 
say 1/3 of the tests) os228 or os229 returns strange results. 2 cores 
are slow (slower than the first 4 nodes) but the 3rd core is extremely fast.


2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing 
Times...
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4)
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5)
:34 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7)
:39 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8)
:37 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10)
:*48 sec*
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11)
:35 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12)
:38 sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13)
:37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14)
:37 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15)
:38 sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16)
:*43 sec*
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17)
:35 sec

TOTAL CORRELATION TIME: 48 sec


or another test:

2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing 
Times...
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1)
:170 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2)
:161 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3)
:158 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4)
:142 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5)
:*256 sec*
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6)
:156 sec
2010-08-09 15:28:10,947 272904 DEBUG 

Re: [OMPI users] Hyper-thread architecture effect on MPI jobs

2010-08-11 Thread pooja varshneya
Saygin,

You can use mpstat tool to see the load on each core at runtime.

Do you know exactly which particular calls are taking longer time ?
You can run just those two computations (one at a time) on a different
machine and check if the other machines have similar or lesser
computation time.

- Pooja

On Wed, Aug 11, 2010 at 10:55 AM, Saygin Arkan  wrote:
> Hello,
>
> I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the
> following properties, os221, os222, os223, os224:
>
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 23
> model name  : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
> stepping    : 7
> cache size  : 3072 KB
> physical id : 0
> siblings    : 4
> core id : 3
> cpu cores   : 4
> fpu : yes
> fpu_exception   : yes
> cpuid level : 10
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est
> tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
> bogomips    : 4999.40
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
>
> and the problematic, hyper-threaded 2 machines are as follows, os228 and
> os229:
>
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 26
> model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
> stepping    : 5
> cache size  : 8192 KB
> physical id : 0
> siblings    : 8
> core id : 3
> cpu cores   : 4
> fpu : yes
> fpu_exception   : yes
> cpuid level : 11
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
> est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
> bogomips    : 5396.88
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
>
>
> The problem is: those 2 machines seem to be having 8 cores (virtually,
> actualy core number is 4).
> When I submit an MPI job, I calculated the comparison times in the cluster.
> I got strange results.
>
> I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say
> 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow
> (slower than the first 4 nodes) but the 3rd core is extremely fast.
>
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing
> Times...
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4)    :37
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5)    :34
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7)    :39
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8)    :37
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10)    :48
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11)    :35
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13)    :37
> sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14)    :37
> sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15)    :38
> sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16)    :43
> sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17)    :35
> sec
> TOTAL CORRELATION TIME: 48 sec
>
>
> or another test:
>
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing
> Times...
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1)
> :170 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2)
> :161 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3)
> :158 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4)
> :142 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5)
> :256 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6)
> :156 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(7)
> :162 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(8)
> :159 sec
> 2010-08-09 15:28:10,947 272905 DEBUG 

[OMPI users] Hyper-thread architecture effect on MPI jobs

2010-08-11 Thread Saygin Arkan
Hello,

I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the
following properties, os221, os222, os223, os224:

vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
stepping: 7
cache size  : 3072 KB
physical id : 0
siblings: 4
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est
tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips: 4999.40
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual

and the problematic, hyper-threaded 2 machines are as follows, os228 and
os229:

vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping: 5
cache size  : 8192 KB
physical id : 0
siblings: 8
core id : 3
cpu cores   : 4
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
bogomips: 5396.88
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual


The problem is: those 2 machines seem to be having 8 cores (virtually,
actualy core number is 4).
When I submit an MPI job, I calculated the comparison times in the cluster.
I got strange results.

I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say
1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow
(slower than the first 4 nodes) but the 3rd core is extremely fast.

2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing
Times...
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4):37
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5):34
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7):39
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8):37
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10):*48
sec*
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11):35
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12):38
sec
2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13):37
sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14):37
sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15):38
sec
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16):*43
sec*
2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17):35
sec
TOTAL CORRELATION TIME: 48 sec


or another test:

2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing
Times...
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1)
:170 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2)
:161 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3)
:158 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4)
:142 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5):*256
sec*
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6)
:156 sec
2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(7)
:162 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(8)
:159 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(9)
:168 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(10)
:141 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os229 RANK(11)
:136 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os223 RANK(12)
:173 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os221 RANK(13)
:164 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(14)
:171 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(15)
:156 sec
2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(16)
:136 sec
2010-08-09 

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Ashley Pittman

On 11 Aug 2010, at 05:10, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.

Going off-topic here but I think it's worth saying:

If you have a dataset that requires collective communication then use the 
function call that best matches what you are trying to do, far to many people 
try and re-implement the collectives in their own code and it nearly always 
goes badly, as someone who's spent many years implementing collectives I've 
lost count of the number of times I've made someones code go faster by 
replacing 500+ lines of code with a single call to MPI_Gather().

In the rare case that you find that some collectives are slower than they 
should be for your specific network and message size then the best thing to do 
is to work with the Open-MPI developers to tweak the thresholds so a better 
algorithm gets picked by the library.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Jeff Squyres
On Aug 11, 2010, at 12:10 AM, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.  

Just to be totally clear: MPI_BCAST is defined to be "reliable", in the sense 
that it will complete or invoke an error (vs. unreliable data streams like UDP 
where sending a packet may or may not arrive at the receiver).  

I think you're saying that something in your setup does not appear to be 
functioning properly -- possibly an OMPI bug, possibly TCP timeouts, possibly 
incorrect use of MPI, possibly ...etc.  But I just wanted to disambiguate the 
meaning of the word "reliable" here.

> Jeff says that all OpenMPI calls are implemented with point to point B-tree 
> style communications of log N transmissions

Just to clarify so that I'm not mis-quoted, I said: "All of Open MPI's 
network-based collectives use point-to-point communications underneath (shared 
memory may not, but that's not the issue here)".  

1. "Collectives" means a very different thing than "all Open MPI calls".
2. Some of our algorithms are not based on binary (or binomial -- it's not 
clear what you meant) trees.

Sorry to be so pedantic -- but mis-quotes like this have been the source of 
huge misunderstandings in the past.

It is also worth noting that Open MPI's collectives are implemented with 
plugins -- there's nothing preventing a new plugin that does *not* use 
point-to-point communication calls (like the shared memory collective 
implementations, or multicast, or some other kind of hardware collective 
offload, or ...).

Indeed, I should point out that my statement was not entirely correct because 
Voltaire just recently committed the "fca" plugin to the OMPI development trunk 
(to be introduced in OMPI v1.5) that uses IB hardware offloading for MPI 
collective implementations -- see their press releases and marketing material 
for how this stuff works.  Mellanox has slightly different MPI collective IB 
hardware offloading technology for Open MPI, too.

> So I guess that altoall would be N log N

I'm not sure of the complexity of OMPI's alltoall algorithms offhand.  I see at 
least 3 algorithms after *quick* look in the OMPI source code.  They probably 
all have their own complexities, but need to be viewed in the context of when 
those algorithms allow themselves to be used (e.g., O(N) may not matter if 
there's a small number of peers with small messages).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Jeff Squyres
On Aug 11, 2010, at 9:54 AM, Jeff Squyres wrote:

> (I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for 
> massive data transfers like you describe)

Wrong wrong wrong -- I should have checked the code before sending.  I made the 
incorrect assumption that OMPI still only had a trivial gather implementation.  
It does not; there are several different algorithms that may be used for 
ALLGATHER (as determined on the fly at run time, blah, blah, blah).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Jeff Squyres
On Aug 10, 2010, at 10:09 PM, Randolph Pullen wrote:

> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data movement 
> program to affect an N to N solution.  The actual mechanism I am using is to 
> spawn N copies of mpirun from PVM across the cluster. So each 1 to N MPI 
> application starts at the same time with a different node as root.

You mentioned that each root has a large amount of data to broadcast.  How 
large?  

Have you done back-of-the-envelope kinds of calculations to determine if you're 
hitting link contention kinds of limits -- e.g., would running a series of N/M 
broadcasts sequentially actually result in a net speedup (vs. running all N 
broadcasts simultaneously) because of lack of network congestion / contention?

If the messages are as large as you imply, then link contention must be taken 
into account of overall performance, particularly if you're using more than 
just a handful of nodes.

> Yes I know this is a bit odd…  It was an attempt to be lazy and not re-write 
> the code (again) and this appears to be a potential log N solution.

I'm not sure I understand that statement -- why would this be a log(n) solution 
if everyone is broadcasting simultaneously? (and therefore each root is 
assumedly using most/all available send bandwidth from its link)

> My thoughts are that the problem must be either:
> 
> 1)Some bug in my code that does not occur normally (this seems unlikely 
> because it halts in Bcast and runs in the normal 1 to N manner)
> 2)Something in MPI is fouling the bcast call
> 3)Something in PVM is fouling the bcast call
> 
> Obviously, this is not the PVM forum, but have I missed anything?

A fourth possibility is that the network is dropping something that it 
shouldn't be (with high link contention, this is possible).  You haven't 
mentioned, but I'm assuming that you're running over ethernet -- perhaps you're 
running into TCP drops and therefore (very long) TCP retransmit timeouts.

If you want to remove PVM from the equation, you could mpirun a trivial 
bootstrap application across all your nodes that, on each MCW rank process, 
calls MPI_COMM_SPAWN on MPI_COMM_SELF for the broadcast that is supposed to be 
rooted on that node.

> BTW: Implementing Bcast with Multicast or a combination of both multicasts 
> and p2p transfers is another option and described by Hoefler et. al. in their 
> paper “A practically constant-time MPI Broadcast Algorithm for large-scale 
> InfiniBand Clusters with Multicast”.

Yep; I've read it.  Torsten's a smart guy.  :-)  I'd love to see a plugin 
contributed that implements this algorithm, or one of other reliable multicast 
algorithms.

Keep in mind that if N (where N is large) roots are all transmitting very large 
multicast messages simultaneously, this is a situation where networks are free 
to drop.  In a pathological case like yours, N simultaneous multicasts may not 
perform as well as you would expect.

> From here I need to decide to:
> 
> 1)Generate a minimal example but given that this will require PVM, it is 
> unlikely to see much use.

I think if you can write a small MPI-only example, that would be most helpful.

> 2)Write a N to N transfer system in MPI using inter-communicators, 
> however this may not scale with only p2p transfers and is probably N Log N at 
> best.

Intercommunicators are a red herring here.  They were mentioned earlier in the 
thread because people thought you were using the MPI accept/connect model of 
joining multiple MPI processes together.  If you aren't doing that, intercomms 
are likely unnecessary.

> 3)Write the N to N transfer system in PVM, Open Fabric calls or something 
> that supports broadcast/multicast calls.

I'm not sure if OpenFabrics verbs support multicast.  Mellanox ConnectX cards 
were supposed to do this eventually, but I don't know if that capability ever 
was finished (Cisco left the IB business a while ago, so I've stopped paying 
attention to IB developments).

> My application must transfer a large (potentially huge) amount of tuples from 
> a table distributed across the cluster to a table replicated on each node.  
> The similar (1 to N) system compresses tuples into 64k pages and sends these. 
>  The same method would be used and the page size could be varied for 
> efficiency.
> 
> What are your thoughts?  Can OpenMPI do this in under N log N time?

(Open) MPI is just a message passing library -- in terms of raw bandwidth 
transfer, it can pretty much do anything that your underlying network can do.  
Whether MPI_BCAST or MPI_ALLGATHER is the right mechanism or not is a different 
issue.

(I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for 
massive data transfers like you describe)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Richard Treumann
Randolf 

I am confused about using multiple, concurrent mpirun operations.  If 
there are M uses of mpirun and each starts N tasks (carried out under pvm 
or any other way) I would expect you to have M completely independent MPI 
jobs with N tasks (processes) each.  You could have some root in each of 
the M MPI jobs do an MPI_Bcast to the other N-1) in that job but there is 
no way in MPI (without using accept.connect) to get tasks of job 0 to give 
data to tasks of jobs 1-(m-1).

With M uses of mpirun, you have M worlds that are forever isolated from 
the other M-1 worlds (again, unless you do accept/connect)

In what sense are you treating this as an single MxN application?   ( I 
use M & N to keep them distinct. I assume if M == N, we have your case)


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Randolph Pullen
Sure, but broadcasts are faster - less reliable apparently, but much faster for 
large clusters.  Jeff says that all OpenMPI calls are implemented with point to 
point B-tree style communications of log N transmissions
So I guess that altoall would be N log N

--- On Wed, 11/8/10, Terry Frankcombe  wrote:

From: Terry Frankcombe 
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" 
Received: Wednesday, 11 August, 2010, 1:57 PM

On Tue, 2010-08-10 at 19:09 -0700, Randolph Pullen wrote:
> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data
> movement program to affect an N to N solution.

I'm no MPI guru, nor do I completely understand what you are doing, but
isn't this an allgather (or possibly an alltoall)?



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] MPI_Bcast issue

2010-08-11 Thread Terry Frankcombe
On Tue, 2010-08-10 at 19:09 -0700, Randolph Pullen wrote:
> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data
> movement program to affect an N to N solution.

I'm no MPI guru, nor do I completely understand what you are doing, but
isn't this an allgather (or possibly an alltoall)?