Re: [OMPI users] MPI_Bcast issue
Interesting point. --- On Thu, 12/8/10, Ashley Pittmanwrote: From: Ashley Pittman Subject: Re: [OMPI users] MPI_Bcast issue To: "Open MPI Users" Received: Thursday, 12 August, 2010, 12:22 AM On 11 Aug 2010, at 05:10, Randolph Pullen wrote: > Sure, but broadcasts are faster - less reliable apparently, but much faster > for large clusters. Going off-topic here but I think it's worth saying: If you have a dataset that requires collective communication then use the function call that best matches what you are trying to do, far to many people try and re-implement the collectives in their own code and it nearly always goes badly, as someone who's spent many years implementing collectives I've lost count of the number of times I've made someones code go faster by replacing 500+ lines of code with a single call to MPI_Gather(). In the rare case that you find that some collectives are slower than they should be for your specific network and message size then the best thing to do is to work with the Open-MPI developers to tweak the thresholds so a better algorithm gets picked by the library. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Bcast issue
I (a single user) am running N separate MPI applications doing 1 to N broadcasts over PVM, each MPI application is started on each machine simultaneously by PVM - the reasons are back in the post history. The problem is that they somehow collide - yes I know this should not happen, the question is why. --- On Wed, 11/8/10, Richard Treumannwrote: From: Richard Treumann Subject: Re: [OMPI users] MPI_Bcast issue To: "Open MPI Users" Received: Wednesday, 11 August, 2010, 11:34 PM Randolf I am confused about using multiple, concurrent mpirun operations. If there are M uses of mpirun and each starts N tasks (carried out under pvm or any other way) I would expect you to have M completely independent MPI jobs with N tasks (processes) each. You could have some root in each of the M MPI jobs do an MPI_Bcast to the other N-1) in that job but there is no way in MPI (without using accept.connect) to get tasks of job 0 to give data to tasks of jobs 1-(m-1). With M uses of mpirun, you have M worlds that are forever isolated from the other M-1 worlds (again, unless you do accept/connect) In what sense are you treating this as an single MxN application? ( I use M & N to keep them distinct. I assume if M == N, we have your case) Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 -Inline Attachment Follows- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Hyper-thread architecture effect on MPI jobs
The way MPI processes are being assigned to hardware threads is perhaps neither controlled nor optimal. On the HT nodes, two processes may end up sharing the same core, with poorer performance. Try submitting your job like this % cat myrankfile1 rank 0=os223 slot=0 rank 1=os221 slot=0 rank 2=os222 slot=0 rank 3=os224 slot=0 rank 4=os228 slot=0 rank 5=os229 slot=0 rank 6=os223 slot=1 rank 7=os221 slot=1 rank 8=os222 slot=1 rank 9=os224 slot=1 rank 10=os228 slot=1 rank 11=os229 slot=1 rank 12=os223 slot=2 rank 13=os221 slot=2 rank 14=os222 slot=2 rank 15=os224 slot=2 rank 16=os228 slot=2 rank 17=os229 slot=2 % mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile myrankfile1 ./a.out You can also try % cat myrankfile2 rank 0=os223 slot=0 rank 1=os221 slot=0 rank 2=os222 slot=0 rank 3=os224 slot=0 rank 4=os228 slot=0 rank 5=os229 slot=0 rank 6=os223 slot=1 rank 7=os221 slot=1 rank 8=os222 slot=1 rank 9=os224 slot=1 rank 10=os228 slot=2 rank 11=os229 slot=2 rank 12=os223 slot=2 rank 13=os221 slot=2 rank 14=os222 slot=2 rank 15=os224 slot=2 rank 16=os228 slot=4 rank 17=os229 slot=4 % mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile myrankfile2 ./a.out which one reproduces your problem and which one avoids it depends on how the BIOS numbers your HTs. Once you can confirm you understand the problem, you (with the help of this list) can devise a solution approach for your situation. Saygin Arkan wrote: Hello, I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the following properties, os221, os222, os223, os224: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz stepping : 7 cache size : 3072 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 4999.40 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual and the problematic, hyper-threaded 2 machines are as follows, os228 and os229: vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping : 5 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida bogomips : 5396.88 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual The problem is: those 2 machines seem to be having 8 cores (virtually, actualy core number is 4). When I submit an MPI job, I calculated the comparison times in the cluster. I got strange results. I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow (slower than the first 4 nodes) but the 3rd core is extremely fast. 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing Times... 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4) :37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5) :34 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7) :39 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8) :37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10) :48 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11) :35 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13) :37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14) :37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15) :38 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16) :43 sec
Re: [OMPI users] Hyper-thread architecture effect on MPI jobs
Hi Saygin You could: 1) turn off hyperthreading (on BIOS), or 2) use the mpirun options (you didn't send your mpirun command) to distribute the processes across the nodes, cores, etc. "man mpirun" is a good resource, see the explanations about the -byslot, -bynode, -loadbalance options. 3) In addition, you can use the mca parameters to set processor affinity in the mpirun command line "mpirun -mca mpi_paffinity_alone 1 ..." I don't know how this will play in a hyperthreaded machine, but it works fine in our dual processor quad-core computers (not hyperthreaded). Depending on your code, hyperthreading may not help performance anyway. I hope this helps, Gus Correa Saygin Arkan wrote: Hello, I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the following properties, os221, os222, os223, os224: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz stepping: 7 cache size : 3072 KB physical id : 0 siblings: 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips: 4999.40 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual and the problematic, hyper-threaded 2 machines are as follows, os228 and os229: vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping: 5 cache size : 8192 KB physical id : 0 siblings: 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida bogomips: 5396.88 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual The problem is: those 2 machines seem to be having 8 cores (virtually, actualy core number is 4). When I submit an MPI job, I calculated the comparison times in the cluster. I got strange results. I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow (slower than the first 4 nodes) but the 3rd core is extremely fast. 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing Times... 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4) :37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5) :34 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7) :39 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8) :37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10) :*48 sec* 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11) :35 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12) :38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13) :37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14) :37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15) :38 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16) :*43 sec* 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17) :35 sec TOTAL CORRELATION TIME: 48 sec or another test: 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing Times... 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1) :170 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2) :161 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3) :158 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4) :142 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5) :*256 sec* 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6) :156 sec 2010-08-09 15:28:10,947 272904 DEBUG
Re: [OMPI users] Hyper-thread architecture effect on MPI jobs
Saygin, You can use mpstat tool to see the load on each core at runtime. Do you know exactly which particular calls are taking longer time ? You can run just those two computations (one at a time) on a different machine and check if the other machines have similar or lesser computation time. - Pooja On Wed, Aug 11, 2010 at 10:55 AM, Saygin Arkanwrote: > Hello, > > I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the > following properties, os221, os222, os223, os224: > > vendor_id : GenuineIntel > cpu family : 6 > model : 23 > model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz > stepping : 7 > cache size : 3072 KB > physical id : 0 > siblings : 4 > core id : 3 > cpu cores : 4 > fpu : yes > fpu_exception : yes > cpuid level : 10 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm > constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est > tm2 ssse3 cx16 xtpr sse4_1 lahf_lm > bogomips : 4999.40 > clflush size : 64 > cache_alignment : 64 > address sizes : 36 bits physical, 48 bits virtual > > and the problematic, hyper-threaded 2 machines are as follows, os228 and > os229: > > vendor_id : GenuineIntel > cpu family : 6 > model : 26 > model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz > stepping : 5 > cache size : 8192 KB > physical id : 0 > siblings : 8 > core id : 3 > cpu cores : 4 > fpu : yes > fpu_exception : yes > cpuid level : 11 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx > est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida > bogomips : 5396.88 > clflush size : 64 > cache_alignment : 64 > address sizes : 36 bits physical, 48 bits virtual > > > The problem is: those 2 machines seem to be having 8 cores (virtually, > actualy core number is 4). > When I submit an MPI job, I calculated the comparison times in the cluster. > I got strange results. > > I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say > 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow > (slower than the first 4 nodes) but the 3rd core is extremely fast. > > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing > Times... > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4) :37 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5) :34 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7) :39 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8) :37 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10) :48 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11) :35 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12) :38 > sec > 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13) :37 > sec > 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14) :37 > sec > 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15) :38 > sec > 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16) :43 > sec > 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17) :35 > sec > TOTAL CORRELATION TIME: 48 sec > > > or another test: > > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing > Times... > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1) > :170 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2) > :161 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3) > :158 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4) > :142 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5) > :256 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6) > :156 sec > 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(7) > :162 sec > 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(8) > :159 sec > 2010-08-09 15:28:10,947 272905 DEBUG
[OMPI users] Hyper-thread architecture effect on MPI jobs
Hello, I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the following properties, os221, os222, os223, os224: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz stepping: 7 cache size : 3072 KB physical id : 0 siblings: 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips: 4999.40 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual and the problematic, hyper-threaded 2 machines are as follows, os228 and os229: vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping: 5 cache size : 8192 KB physical id : 0 siblings: 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida bogomips: 5396.88 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual The problem is: those 2 machines seem to be having 8 cores (virtually, actualy core number is 4). When I submit an MPI job, I calculated the comparison times in the cluster. I got strange results. I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow (slower than the first 4 nodes) but the 3rd core is extremely fast. 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing Times... 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4):37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5):34 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7):39 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8):37 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10):*48 sec* 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11):35 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12):38 sec 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13):37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14):37 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15):38 sec 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16):*43 sec* 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17):35 sec TOTAL CORRELATION TIME: 48 sec or another test: 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing Times... 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1) :170 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2) :161 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3) :158 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4) :142 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5):*256 sec* 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6) :156 sec 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(7) :162 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(8) :159 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(9) :168 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(10) :141 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os229 RANK(11) :136 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os223 RANK(12) :173 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os221 RANK(13) :164 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(14) :171 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(15) :156 sec 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(16) :136 sec 2010-08-09
Re: [OMPI users] MPI_Bcast issue
On 11 Aug 2010, at 05:10, Randolph Pullen wrote: > Sure, but broadcasts are faster - less reliable apparently, but much faster > for large clusters. Going off-topic here but I think it's worth saying: If you have a dataset that requires collective communication then use the function call that best matches what you are trying to do, far to many people try and re-implement the collectives in their own code and it nearly always goes badly, as someone who's spent many years implementing collectives I've lost count of the number of times I've made someones code go faster by replacing 500+ lines of code with a single call to MPI_Gather(). In the rare case that you find that some collectives are slower than they should be for your specific network and message size then the best thing to do is to work with the Open-MPI developers to tweak the thresholds so a better algorithm gets picked by the library. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Bcast issue
On Aug 11, 2010, at 12:10 AM, Randolph Pullen wrote: > Sure, but broadcasts are faster - less reliable apparently, but much faster > for large clusters. Just to be totally clear: MPI_BCAST is defined to be "reliable", in the sense that it will complete or invoke an error (vs. unreliable data streams like UDP where sending a packet may or may not arrive at the receiver). I think you're saying that something in your setup does not appear to be functioning properly -- possibly an OMPI bug, possibly TCP timeouts, possibly incorrect use of MPI, possibly ...etc. But I just wanted to disambiguate the meaning of the word "reliable" here. > Jeff says that all OpenMPI calls are implemented with point to point B-tree > style communications of log N transmissions Just to clarify so that I'm not mis-quoted, I said: "All of Open MPI's network-based collectives use point-to-point communications underneath (shared memory may not, but that's not the issue here)". 1. "Collectives" means a very different thing than "all Open MPI calls". 2. Some of our algorithms are not based on binary (or binomial -- it's not clear what you meant) trees. Sorry to be so pedantic -- but mis-quotes like this have been the source of huge misunderstandings in the past. It is also worth noting that Open MPI's collectives are implemented with plugins -- there's nothing preventing a new plugin that does *not* use point-to-point communication calls (like the shared memory collective implementations, or multicast, or some other kind of hardware collective offload, or ...). Indeed, I should point out that my statement was not entirely correct because Voltaire just recently committed the "fca" plugin to the OMPI development trunk (to be introduced in OMPI v1.5) that uses IB hardware offloading for MPI collective implementations -- see their press releases and marketing material for how this stuff works. Mellanox has slightly different MPI collective IB hardware offloading technology for Open MPI, too. > So I guess that altoall would be N log N I'm not sure of the complexity of OMPI's alltoall algorithms offhand. I see at least 3 algorithms after *quick* look in the OMPI source code. They probably all have their own complexities, but need to be viewed in the context of when those algorithms allow themselves to be used (e.g., O(N) may not matter if there's a small number of peers with small messages). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Bcast issue
On Aug 11, 2010, at 9:54 AM, Jeff Squyres wrote: > (I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for > massive data transfers like you describe) Wrong wrong wrong -- I should have checked the code before sending. I made the incorrect assumption that OMPI still only had a trivial gather implementation. It does not; there are several different algorithms that may be used for ALLGATHER (as determined on the fly at run time, blah, blah, blah). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Bcast issue
On Aug 10, 2010, at 10:09 PM, Randolph Pullen wrote: > Jeff thanks for the clarification, > What I am trying to do is run N concurrent copies of a 1 to N data movement > program to affect an N to N solution. The actual mechanism I am using is to > spawn N copies of mpirun from PVM across the cluster. So each 1 to N MPI > application starts at the same time with a different node as root. You mentioned that each root has a large amount of data to broadcast. How large? Have you done back-of-the-envelope kinds of calculations to determine if you're hitting link contention kinds of limits -- e.g., would running a series of N/M broadcasts sequentially actually result in a net speedup (vs. running all N broadcasts simultaneously) because of lack of network congestion / contention? If the messages are as large as you imply, then link contention must be taken into account of overall performance, particularly if you're using more than just a handful of nodes. > Yes I know this is a bit odd… It was an attempt to be lazy and not re-write > the code (again) and this appears to be a potential log N solution. I'm not sure I understand that statement -- why would this be a log(n) solution if everyone is broadcasting simultaneously? (and therefore each root is assumedly using most/all available send bandwidth from its link) > My thoughts are that the problem must be either: > > 1)Some bug in my code that does not occur normally (this seems unlikely > because it halts in Bcast and runs in the normal 1 to N manner) > 2)Something in MPI is fouling the bcast call > 3)Something in PVM is fouling the bcast call > > Obviously, this is not the PVM forum, but have I missed anything? A fourth possibility is that the network is dropping something that it shouldn't be (with high link contention, this is possible). You haven't mentioned, but I'm assuming that you're running over ethernet -- perhaps you're running into TCP drops and therefore (very long) TCP retransmit timeouts. If you want to remove PVM from the equation, you could mpirun a trivial bootstrap application across all your nodes that, on each MCW rank process, calls MPI_COMM_SPAWN on MPI_COMM_SELF for the broadcast that is supposed to be rooted on that node. > BTW: Implementing Bcast with Multicast or a combination of both multicasts > and p2p transfers is another option and described by Hoefler et. al. in their > paper “A practically constant-time MPI Broadcast Algorithm for large-scale > InfiniBand Clusters with Multicast”. Yep; I've read it. Torsten's a smart guy. :-) I'd love to see a plugin contributed that implements this algorithm, or one of other reliable multicast algorithms. Keep in mind that if N (where N is large) roots are all transmitting very large multicast messages simultaneously, this is a situation where networks are free to drop. In a pathological case like yours, N simultaneous multicasts may not perform as well as you would expect. > From here I need to decide to: > > 1)Generate a minimal example but given that this will require PVM, it is > unlikely to see much use. I think if you can write a small MPI-only example, that would be most helpful. > 2)Write a N to N transfer system in MPI using inter-communicators, > however this may not scale with only p2p transfers and is probably N Log N at > best. Intercommunicators are a red herring here. They were mentioned earlier in the thread because people thought you were using the MPI accept/connect model of joining multiple MPI processes together. If you aren't doing that, intercomms are likely unnecessary. > 3)Write the N to N transfer system in PVM, Open Fabric calls or something > that supports broadcast/multicast calls. I'm not sure if OpenFabrics verbs support multicast. Mellanox ConnectX cards were supposed to do this eventually, but I don't know if that capability ever was finished (Cisco left the IB business a while ago, so I've stopped paying attention to IB developments). > My application must transfer a large (potentially huge) amount of tuples from > a table distributed across the cluster to a table replicated on each node. > The similar (1 to N) system compresses tuples into 64k pages and sends these. > The same method would be used and the page size could be varied for > efficiency. > > What are your thoughts? Can OpenMPI do this in under N log N time? (Open) MPI is just a message passing library -- in terms of raw bandwidth transfer, it can pretty much do anything that your underlying network can do. Whether MPI_BCAST or MPI_ALLGATHER is the right mechanism or not is a different issue. (I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for massive data transfers like you describe) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Bcast issue
Randolf I am confused about using multiple, concurrent mpirun operations. If there are M uses of mpirun and each starts N tasks (carried out under pvm or any other way) I would expect you to have M completely independent MPI jobs with N tasks (processes) each. You could have some root in each of the M MPI jobs do an MPI_Bcast to the other N-1) in that job but there is no way in MPI (without using accept.connect) to get tasks of job 0 to give data to tasks of jobs 1-(m-1). With M uses of mpirun, you have M worlds that are forever isolated from the other M-1 worlds (again, unless you do accept/connect) In what sense are you treating this as an single MxN application? ( I use M & N to keep them distinct. I assume if M == N, we have your case) Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363
Re: [OMPI users] MPI_Bcast issue
Sure, but broadcasts are faster - less reliable apparently, but much faster for large clusters. Jeff says that all OpenMPI calls are implemented with point to point B-tree style communications of log N transmissions So I guess that altoall would be N log N --- On Wed, 11/8/10, Terry Frankcombewrote: From: Terry Frankcombe Subject: Re: [OMPI users] MPI_Bcast issue To: "Open MPI Users" Received: Wednesday, 11 August, 2010, 1:57 PM On Tue, 2010-08-10 at 19:09 -0700, Randolph Pullen wrote: > Jeff thanks for the clarification, > What I am trying to do is run N concurrent copies of a 1 to N data > movement program to affect an N to N solution. I'm no MPI guru, nor do I completely understand what you are doing, but isn't this an allgather (or possibly an alltoall)? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Bcast issue
On Tue, 2010-08-10 at 19:09 -0700, Randolph Pullen wrote: > Jeff thanks for the clarification, > What I am trying to do is run N concurrent copies of a 1 to N data > movement program to affect an N to N solution. I'm no MPI guru, nor do I completely understand what you are doing, but isn't this an allgather (or possibly an alltoall)?