Re: [OMPI users] Problems with mpirun
On Friday 03 September 2010, Alexander Kalinin wrote: > Hello! > > I have a problem to run mpi program. My command line is: > $ mpirun -np 1 ./ksurf > > But I got an error: > [0,0,0] mca_oob_tcp_init: invalid address '' returned for selected oob > interfaces. > [0,0,0] ORTE_ERROR_LOG: Error in file oob_tcp.c at line 880 > > My working environment is: Fedora 7, openmpi-1.1 > > Is it possible to treat this problem? Both Fedora7 and OpenMPI-1.1 are ancient. I'd suggest you upgrade to current versions before you invest time debugging this. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Low Open MPI performance on InfiniBand and shared memory?
On Friday 09 July 2010, Andreas Schäfer wrote: > Thanks, those were good suggestions. > > On 11:53 Fri 09 Jul , Peter Kjellstrom wrote: > > On an E5520 (nehalem) node I get ~5 GB/s ping-pong for >64K sizes. > > I just tried a Core i7 system which maxes at 6550 MB/s for the > ping-pong test. It makes quite some difference if the ranks end up on the same socket or different sockets (on an i7 you only have one). > > On QDR IB on similar nodes I get ~3 GB/s ping-pong for >256K. > > I'll try to find a Intel system to repeat the tests. Maybe it's AMD's > different memory subsystem/cache architecture which is slowing Open > MPI? Or are my systems just badly configured? 8x pci-express gen2 5GT/s should show figures like mine. If it's pci-express gen1 or gen2 2.5GT/s or 4x or if the IB only came up with two lanes then 1500 is expected. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Low Open MPI performance on InfiniBand and shared memory?
On Friday 09 July 2010, Andreas Schäfer wrote: > Hi, > > I'm evaluating Open MPI 1.4.2 on one of our BladeCenters and I'm > getting via InfiniBand about 1550 MB/s and via shared memory about > 1770 for the PingPong benchmark in Intel's MPI benchmark. (That > benchmark is just an example, I'm seeing similar numbers for my own > codes.) Two factors that make a big difference, size of the operations and type of node (cpu model). On an E5520 (nehalem) node I get ~5 GB/s ping-pong for >64K sizes. On QDR IB on similar nodes I get ~3 GB/s ping-pong for >256K. Numbers are for 1.4.1 YMMV. I couldn't find an AMD node similar to yours, sorry. /Peter > Each node has two AMD hex-cores and two 40 Gbps InfiniBand ports, so I > wonder if I shouldn't be getting a significantly higher throughput on > InfiniBand. Considering the CPUs' memory bandwidth, I believe that > shared memory throughput should be much higher as well. > > Are those numbers what is to be expected? If not: any ideas how to > debug this or tune Open MPI? > > Thanks in advance > -Andreas > > ps: if it's any help, this is what iblinkinfo is telling me > (tests were run on faui36[bc]) signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] (no subject)
On Friday 11 June 2010, asmae.elbahlo...@mpsa.com wrote: > Hello > i have a problem with parFoam, when i type in the terminal parafoam, it > lauches nothing but in the terminal i have : This is the OpenMPI mailling list, not OpenFoam. I suggest you contact the team behind OpenFoam. I also suggest that you post plain text to mailing lists in the future and not html (and while you're at it do use a descriptive subject line). /Peter > tta201@linux-qv31:/media/OpenFoam/FOAMpro/FOAMpro-1.5-2.2/FOAM-1.5-2.2/tuto >rials/icoFoam/cavity> paraFoam Xlib: extension "GLX" missing on display > ":0.0". Xlib: extension "GLX" missing on display ":0.0". Xlib: extension > "GLX" missing on display ":0.0". Xlib: extension "GLX" missing on display > ":0.0". Xlib: extension "GLX" missing on display ":0.0". Xlib: extension > "GLX" missing on display ":0.0". Xlib: extension "GLX" missing on display > ":0.0". Xlib: extension "GLX" missing on display ":0.0". ERROR: In > /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend >ering/vtkXOpenGLRenderWindow.cxx, line 404 vtkXOpenGLRenderWindow > (0x117b3d0): Could not find a decent visual > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > Xlib: extension "GLX" missing on display ":0.0". > ERROR: In > /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend >ering/vtkXOpenGLRenderWindow.cxx, line 404 vtkXOpenGLRenderWindow > (0x117b3d0): Could not find a decent visual > Xlib: extension "GLX" missing on display ":0.0". > ERROR: In > /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend >ering/vtkXOpenGLRenderWindow.cxx, line 611 vtkXOpenGLRenderWindow > (0x117b3d0): GLX not found. Aborting. > > /media/OpenFoam/FOAMpro/FOAMpro-1.5-2.2/FOAM-1.5-2.2/bin/paraFoam: line 81: > 15497 Aborted paraview --data=$caseFile > > > I don't understand the problem, can someone help me please? > thanks -- Peter Kjellström | E-mail: c...@nsc.liu.se National Supercomputer Centre | Sweden | http://www.nsc.liu.se signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Trouble building openmpi 1.2.8 with intel compilers 10.0.23
On Monday 05 April 2010, Steve Swanekamp (L3-Titan Contractor) wrote: > When I try to run the configure utility I get the message that the c++ > compiler can not compile simple c programs. Any ideas? (at least some) Intel compilers need the gcc-c++ distribution package. Have you tested icpc with a simple c++ program? /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] mpi error?
On Thursday 11 March 2010, Matthew MacManes wrote: > Can anybody tell me if this is an error associated with openmpi, versus an > issue with the program I am running (MRBAYES, > https://sourceforge.net/projects/mrbayes/) > > We are trying to run a large simulated dataset using 1,000,000 bases > divided up into 1000 genes, 5 taxa.. An error is occurring, but we are not > sure why. We are using the MPI version of MRBAYES v3.2-cvs on a linux > 16core 24GB RAM machine. It does not appear as if the program runs out of > memory (max memory usage is 13gb). Maybe this is an OpenMPI problem and > not related to MrBayes... > > See snippet of error message below. Can anybody give me any hints about the > source of the problem? > > I am using OPENMPI version 1.4.1. > > ... > Defining charset called gene997 > Defining charset called gene998 > Defining charset called gene999 > Defining charset called gene1000 > Defining partition called Genes > [macmanes:02546] *** Process received signal *** > [macmanes:02546] Signal: Segmentation fault (11) > [macmanes:02546] Signal code: Address not mapped (1) > [macmanes:02546] Failing at address: (nil) > [macmanes:02546] [ 0] /lib/libpthread.so.0 [0x7ffd0f322190] > [macmanes:02546] *** End of error message *** > -- > mpirun noticed that process rank 13 with PID 2546 on node macmanes exited > on signal 11 (Segmentation fault). On of the ranks got a "Segmentation fault". This would typically indicate a problem with the app not the MPI. Maybe you ran out of stack space? (ulimit -s). Have you tried a different/lower number of ranks? /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Problems compiling OpenMPI 1.4 with PGI 9.0-3
On Wednesday 06 January 2010, Tim Miller wrote: > Hi All, > > I am trying to compile OpenMPI 1.4 with PGI 9.0-3 and am getting the > following error in configure: > > checking for functional offsetof macro... no > configure: WARNING: Your compiler does not support offsetof macro > configure: error: Configure: Cannot continue > > I have searched around and found that this error occurs because of a > problem in the configure scripts when PGI 10 is used, but I'm using 9.0-3 > which should not have the configure script issue. Here is the output of > pgcc -V: > > pgcc 9.0-3 64-bit target on x86-64 Linux -tp k8-64e > Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved. > Copyright 2000-2009, STMicroelectronics, Inc. All Rights Reserved. > > I'm not sure what's wrong here as other people have reported being able to > build OpenMPI with PGI 9. Does anyone have any ideas? Maybe a late enough PGI-9 behaves like PGI-10. You could try the 1.4.1-rc1 which should work with PGI-10 and see if it fixes your problems too. /Peter > Thanks, > Tim Miller signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] (no subject)
On Friday 30 October 2009, Konstantinos Angelopoulos wrote: > good part of the day, > > I am trying to run a parallel program (that used to run in a cluster) in my > double core pc. Could openmpi simulate the distribution of the parallel > jobs to my 2 processors If your program is an MPI program then, yes, OpenMPI on your PC would allow you to use both cores (assuming your job can fit on the PC of course). > meaning will qsub work even if it is not a real > cluster? qsub has nothing to do with MPI it belongs to the work load management system or batch queue system. You could install this on your PC as well (see for example torque, SGE or slurm). /Peter > thank you for reading my message and for any answer. > > Konstantinos Angelopoulos signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Openmpi setup with intel compiler.
On Wednesday 30 September 2009, vighn...@aero.iitb.ac.in wrote: ... > during > configuring with Intel 9.0 compiler the installation gives following > error. > > [root@test_node openmpi-1.3.3]# make all install ... > make[3]: Entering directory `/tmp/openmpi-1.3.3/orte' > test -z "/share/apps/mpi/openmpi/intel/lib" || /bin/mkdir -p > "/share/apps/mpi/openmpi/intel/lib" > /bin/sh ../libtool --mode=install /usr/bin/install -c 'libopen-rte.la' > '/share/apps/mpi/openmpi/intel/lib/libopen-rte.la' > libtool: install: error: cannot install `libopen-rte.la' to a directory > not ending in /share/apps/mpi/openmpi/pgi/lib The line above indicates that you've somehow attempted this from a dirty tree and/or environment (dirty from the previous pgi installation...). Try a clean environment, clean build tree. Source the icc/ifort-vars.sh files from your intel install dir, set CC, CXX, FC, F77 and do: "./configure --prefix=... && make && make install" /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?
On Tuesday 29 September 2009, Rahul Nabar wrote: > On Tue, Sep 29, 2009 at 10:40 AM, Eugene Lohwrote: > > to know. It sounds like you want to be able to watch some % utilization > > of a hardware interface as the program is running. I *think* these tools > > (the ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of > > that class. > > You are correct. A real time tool would be best that sniffs at the MPI > traffic. Post mortem profilers would be the next best option I assume. > I was trying to compile MPE but gave up. Too many errors. Trying to > decide if I should prod on or look at another tool. Not MPI aware, but, you could watch network traffic with a tool such as collectl in real-time. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] very bad parallel scaling of vasp using openmpi
On Wednesday 23 September 2009, Rahul Nabar wrote: > On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creagerwrote: > > Most of that bandwidth is in marketing... Sorry, but it's not a high > > performance switch. > > Well, how does one figure out what exactly is a "hih performance > switch"? IMHO 1G Ethernet won't be enough ("high performance" or not). Get yourself some cheap IB HCAs and a switch. The only chance you have with Ethernet is to run some sort of bypass proto (OpenMX etc.) and tune your NICs. /Peter > I've found this an exceedingly hard task. Like the OP posted > the Dell 6248 is rated to give more than a fully subscribed backbone > capacity. Nor I do not know any good third party test lab nor do I > know any switch load testing benchmarks that'd take a switch through > its paces. > > So, how does one go about selecting a good switch? "The most expensive > the better" is somewhat a unsatisfying option! signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Help: Infiniband interface hang
Could you guys please trim your e-mails. No one wants to scroll by 100K-200K old context to see the update (not to mention wasting storage space for people.) /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] scaling problem with openmpi
On Wednesday 20 May 2009, Roman Martonak wrote: > I tried to run with the first dynamic rules file that Pavel proposed > and it works, the time per one MD step on 48 cores decreased from 2.8 > s to 1.8 s as expected. It was clearly the basic linear algorithm that > was causing the problem. I will check the performance of bruck and > pairwise on my HW. It would be nice if it could be tuned further. I'm guessing you'll see even better performance if you change 8192 to 131072 in that config file. That moves up the cross over point between "bruck" and "pair wise". /Peter > Thanks > > Roman signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] scaling problem with openmpi
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote: > > Disabling basic_linear seems like a good idea but your config file sets > > the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to > > result in a message size of that value divided by the number of ranks). > > > > In my testing bruck seems to win clearly (at least for 64 ranks on my IB) > > up to 2048. Hence, the following line may be better: > > > > 131072 2 0 0 # switch to pair wise for size 128K/nranks > > > > Disclaimer: I guess this could differ quite a bit for nranks!=64 and > > different btls. > > Sounds strange for me. From the code is looks that we take the threshold as > is without dividing by number of ranks. Interesting, I may have had to little or too much coffe but the figures in my previous e-mail (3rd run, bruckto2k_pair) was run with the above line. And it very much looks like it switched at 128K/64=2K, not at 128K (which would have been above my largest size of 3000 and as such equiv. to all_bruck). I also ran tests with: 8192 2 0 0 # ... And it seemed to switch between 10 Bytes and 500 Bytes (most likely then at 8192/64=128). My testprogram calls MPI_Alltoall like this: time1 = MPI_Wtime(); for (i = 0; i < repetitions; i++) { MPI_Alltoall(sbuf, message_size, MPI_CHAR, rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD); } time2 = MPI_Wtime(); /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] scaling problem with openmpi
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote: > > With the file Pavel has provided things have changed to the following. > > (maybe someone can confirm) > > > > If message size < 8192 > > bruck > > else > > pairwise > > end > > You are right here. Target of my conf file is disable basic_linear for > medium message side. Disabling basic_linear seems like a good idea but your config file sets the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result in a message size of that value divided by the number of ranks). In my testing bruck seems to win clearly (at least for 64 ranks on my IB) up to 2048. Hence, the following line may be better: 131072 2 0 0 # switch to pair wise for size 128K/nranks Disclaimer: I guess this could differ quite a bit for nranks!=64 and different btls. Here are some figures for this part of the package size range: all_bruck bw for 10 x 10 B : 13.7 Mbytes/s time was: 922.0 �s bw for 10 x 500 B : 45.9 Mbytes/stime was: 13.7 ms bw for 10 x 1000 B : 122.7 Mbytes/s time was: 10.3 ms bw for 10 x 1500 B : 86.9 Mbytes/s time was: 21.8 ms bw for 10 x 2000 B : 120.1 Mbytes/s time was: 21.0 ms bw for 10 x 2047 B : 92.6 Mbytes/s time was: 27.9 ms bw for 10 x 2048 B : 107.3 Mbytes/s time was: 24.1 ms bw for 10 x 2400 B : 93.7 Mbytes/s time was: 32.3 ms bw for 10 x 2800 B : 73.0 Mbytes/s time was: 48.3 ms bw for 10 x 2900 B : 79.5 Mbytes/s time was: 45.9 ms bw for 10 x 2925 B : 89.3 Mbytes/s time was: 41.3 ms bw for 10 x 2950 B : 72.7 Mbytes/s time was: 51.1 ms bw for 10 x 2975 B : 75.2 Mbytes/s time was: 49.8 ms bw for 10 x 3000 B : 74.9 Mbytes/s time was: 50.5 ms bw for 10 x 3100 B : 95.9 Mbytes/s time was: 40.7 ms totaltime was: 479.5 ms all_pair bw for 10 x 10 B : 414.2 kbytes/s time was: 30.4 ms bw for 10 x 500 B : 19.8 Mbytes/stime was: 31.9 ms bw for 10 x 1000 B : 43.3 Mbytes/s time was: 29.1 ms bw for 10 x 1500 B : 63.3 Mbytes/s time was: 29.9 ms bw for 10 x 2000 B : 81.2 Mbytes/s time was: 31.0 ms bw for 10 x 2047 B : 82.3 Mbytes/s time was: 31.3 ms bw for 10 x 2048 B : 83.0 Mbytes/s time was: 31.1 ms bw for 10 x 2400 B : 93.6 Mbytes/s time was: 32.3 ms bw for 10 x 2800 B : 105.0 Mbytes/s time was: 33.6 ms bw for 10 x 2900 B : 107.7 Mbytes/s time was: 33.9 ms bw for 10 x 2925 B : 108.1 Mbytes/s time was: 34.1 ms bw for 10 x 2950 B : 109.6 Mbytes/s time was: 33.9 ms bw for 10 x 2975 B : 111.1 Mbytes/s time was: 33.7 ms bw for 10 x 3000 B : 112.1 Mbytes/s time was: 33.7 ms bw for 10 x 3100 B : 114.5 Mbytes/s time was: 34.1 ms totaltime was: 484.1 ms bruckto2k_pair bw for 10 x 10 B : 11.9 Mbytes/s time was: 1.1 ms bw for 10 x 500 B : 100.3 Mbytes/stime was: 6.3 ms bw for 10 x 1000 B : 115.9 Mbytes/s time was: 10.9 ms bw for 10 x 1500 B : 117.2 Mbytes/s time was: 16.1 ms bw for 10 x 2000 B : 95.7 Mbytes/s time was: 26.3 ms bw for 10 x 2047 B : 96.6 Mbytes/s time was: 26.7 ms bw for 10 x 2048 B : 82.2 Mbytes/s time was: 31.4 ms bw for 10 x 2400 B : 94.1 Mbytes/s time was: 32.1 ms bw for 10 x 2800 B : 105.6 Mbytes/s time was: 33.4 ms bw for 10 x 2900 B : 108.4 Mbytes/s time was: 33.7 ms bw for 10 x 2925 B : 108.3 Mbytes/s time was: 34.0 ms bw for 10 x 2950 B : 109.9 Mbytes/s time was: 33.8 ms bw for 10 x 2975 B : 111.5 Mbytes/s time was: 33.6 ms bw for 10 x 3000 B : 108.3 Mbytes/s time was: 34.9 ms bw for 10 x 3100 B : 114.7 Mbytes/s time was: 34.0 ms totaltime was: 388.4 ms These figures were run on a freshly compiled OpenMPI-1.3.2. The numbers for bruck at smalla package sizes vary a bit from run to run. /Peter > Pasha. signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] scaling problem with openmpi
On Wednesday 20 May 2009, Rolf Vandevaart wrote: ... > If I am understanding what is happening, it looks like the original > MPI_Alltoall made use of three algorithms. (You can look in > coll_tuned_decision_fixed.c) > > If message size < 200 or communicator size > 12 >bruck > else if message size < 3000 >basic linear > else >pairwise > end And 3000 was the observed threshold for bad behaviour so it seems very likely that "basic linear" was the culprit. My testing would suggest that "pairwise" was a good choice for ~3000 (but maybe bruck, as configured by Pavel, is good too). /Peter > With the file Pavel has provided things have changed to the following. > (maybe someone can confirm) > > If message size < 8192 >bruck > else >pairwise > end > > Rolf signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] scaling problem with openmpi
On Tuesday 19 May 2009, Roman Martonak wrote: > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote: > > On Tuesday 19 May 2009, Roman Martonak wrote: > > ... > >> openmpi-1.3.2 time per one MD step is 3.66 s > >> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS > >> = ALL TO ALL COMM 102033. BYTES 4221. = > >> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC = ... > With TASKGROUP=2 the summary looks as follows ... > = ALL TO ALL COMM 231821. BYTES 4221. = > = ALL TO ALL COMM82.716 MB/S 11.830 SEC = Wow, according to this it takes 1/5th the time to do the same number (4221) of alltoalls if the size is (roughly) doubled... (ten times better performance with the larger transfer size) Something is not quite right, could you possibly try to run just the alltoalls like I suggested in my previous e-mail? /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution
On Thursday 07 May 2009, nee...@crlindia.com wrote: > Thanks Pasha for sharing IB Roadmaps with us. But i am more interested in > to find out latency figures since they often matter more than bit rate. > > Could there be rough if not accurate the latency figures being targeted in > IB World? The (low level verbs) latency has AFAIR changed only a few times: 1) started at 5-6us with PCI-X Infinihost3 2) dropped to 3-4us with PCI-express Infinihost3 3) dropped to ~1us with PCI-express ConnectX Disclaimer: rough figures and only for Mellanox chips. /Peter > Regards > > Neeraj Chourasia signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Factor of 10 loss in performance with 1.3.x
On Tuesday 07 April 2009, Eugene Loh wrote: > Iain Bason wrote: > > But maybe Steve should try 1.3.2 instead? Does that have your > > improvements in it? > > 1.3.2 has the single-queue implementation and automatic sizing of the sm > mmap file, both intended to fix problems at large np. At np=2, you > shouldn't expect to see much difference. > > >> And the slowdown doesn't seem to be observed by anyone other than > >> Steve and his colleague? > > > > It would be useful to know who else has compared these two revisions. > > I just ran Netpipe and found that it gave a comparable sm latency as > other pingpong tests. So, in my mind, the question is why Steve sees > latencies that are about 10 usec on a platform that can give 1 usec. > There seems to be something tricky about reproducing that 10-usec > slowdown. I have trouble buying that it's just, "sm latency degraded > from 1 usec to 10 usec when we went from 1.2 to 1.3". If it were as > simple as that, we would all have been aware of the performance > regression. There is some other special ingredient here (other than > OMPI rev) that we're missing. Maybe it's not btl layer related at all. Could be something completely different like maybe messed up processor affinity. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] MPI can not open file?
On Tuesday 07 April 2009, Bernhard Knapp wrote: > Hi > > I am trying to get a parallel job of the gromacs software started. MPI > seems to boot fine but unfortunately it seems not to be able to open a > specified file although it is definitly in the directory where the job > is started. Do all the nodes (in your machinefile) see the same filesystem(s)? Have you tried a trivial mpi-program (like MPI_init, open("...), MPI_fin..)? I have compiled and executed gromacs (4.0.2) sucessfully with several OpenMPI versions. /Peter > I also changed the file permissions to 777 but it does not > affect the result. Any suggestions? > > cheers > Bernhard ... > Program mdrun, VERSION 4.0.3 > Source code file: gmxfio.c, line: 736 > > Can not open file: > 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Heterogeneous OpenFabrics hardware
On Tuesday 27 January 2009, Jeff Squyres wrote: > It is worth clarifying a point in this discussion that I neglected to > mention in my initial post: although Open MPI may not work *by > default* with heterogeneous HCAs/RNICs, it is quite possible/likely > that if you manually configure Open MPI to use the same verbs/hardware > settings across all your HCAs/RNICs (assuming that you use a set of > values that is compatible with all your hardware) that MPI jobs > spanning multiple different kinds of HCAs or RNICs will work fine. > > See this post on the devel list for a few more details: > > http://www.open-mpi.org/community/lists/devel/2009/01/5314.php So is it correct that each rank will check its HCA-model and then pick up suitable settings for that HCA? If so maybe OpenMPI could fall back to a very conservative settings if more than one HCA model was detected among the ranks. Or would this require communication in a stage where that would be complicated and/or ugly? /Peter signature.asc Description: This is a digitally signed message part.
[OMPI users] MPI_Gather bug with reproducer code attached
Problem description: Elements from all ranks are gathered correctly except for the element belonging to the root/target rank of the gather operation when using certain custom MPI-datatypes (see reproducer code). The bug can be toggled by commenting/uncommenting line 11 in the .F90-file. Even though all this is for MPI_Gather the same seems to go for MPI_Gatherv too. I have verified the bad behaviour with several OpenMPI versions from 1.2.3 to 1.3b2. Correct behaviour has been observed on mvapich2 and PlatformMPI. Both gfortran and ifort has been tried. Attached files: BUILD Build instructions RUNRun instructions mpi_gather_test.F90Reproducer source code 4rank_bad_output.txt Bad output 4rank_expected_output.txt Good output /Peter mpif90.openmpi -o mpi_gather_test.local_ompils mpi_gather_test.F90 mpirun.openmpi -np 4 ./mpi_gather_test.local_ompils | sort -nk 2 Module global implicit none include 'mpif.h' ! Handle for MPI_Type_create_struct Integer :: my_mpi_struct Type my_fortran_struct ! With the following line the bug bites, with it commented out the ! behaviour is as expected Integer :: unused_data Integer :: used_data End Type my_fortran_struct End Module global ! - Program mpi_gather_test use global Integer:: i Integer:: nranks Integer, Parameter :: gather_target = 1 Integer:: rank Integer:: ierror Type (my_fortran_struct), Pointer :: source_vector (:) Type (my_fortran_struct), Pointer :: dest_vector(:) call MPI_Init ( ierror ) call MPI_Comm_rank ( MPI_COMM_WORLD, rank, ierror ) call MPI_Comm_size ( MPI_COMM_WORLD, nranks, ierror ) Allocate (source_vector(1), STAT = ierror) Allocate (dest_vector(1:nranks), STAT = ierror) ! Each rank initializes the data to be gathered to its rank number ! for tracing purposes (So we can see what goes where) source_vector(:)%used_data = rank ! Each rank initializes the target buffer with tracing data. The ! expectation is that on the root rank this will be completely over- ! written while on the rest of the ranks it will be unchanged. do i = 1, nranks dest_vector(i)%used_data = 10 * i + rank * 100 + 1000 enddo ! Call the subroutine below that creates the MPI-datatype. call create_datatype ( ierror ) ! Run the actual gather. call MPI_Gather (source_vector, 1, my_mpi_struct, & dest_vector, 1, my_mpi_struct, & gather_target, MPI_COMM_WORLD, ierror) ! Output the content of the used_data part of the dest_vector on ! all ranks. On the root-rank of the gather it is expected that the ! initial data is overwritten with the data from the source_vectors ! gathered from all ranks. do i = 1, nranks print *, 'rank:', rank, 'element:', i, 'dest_vector%used_data: ', & dest_vector(i)%used_data enddo call MPI_Finalize (ierror) end program mpi_gather_test ! - subroutine create_datatype (ierror) use global integer, Intent (Out) :: ierror integer (kind=MPI_ADDRESS_KIND) :: start, loc_used_data, loc_ub integer (kind=MPI_ADDRESS_KIND) :: disp (3) integer :: lengths (3), types (3), ext_size Type (my_fortran_struct) :: template (2) ierror = 0 ! Get the offsets (displacements) from the template vector of ! my_fortran_struct type call MPI_Get_address (template(1), start, ierror) call MPI_Get_address (template(1)%used_data, loc_used_data, ierror) call MPI_Get_address (template(2), loc_ub, ierror) disp (1) = 0 disp (2) = loc_used_data - start disp (3) = loc_ub- start lengths (1:3) = 1 types (1) = MPI_LB types (2) = MPI_INTEGER types (3) = MPI_UB ! Create the MPI-type call MPI_Type_create_struct (3, lengths, disp, types, & my_mpi_struct, ierror) call MPI_Type_commit (my_mpi_struct, ierror) end subroutine create_datatype rank: 0 element: 1 dest_vector%used_data: 1010 rank: 0 element: 2 dest_vector%used_data: 1020 rank: 0 element: 3 dest_vector%used_data: 1030 rank: 0 element: 4 dest_vector%used_data: 1040 rank: 1 element: 1 dest_vector%used_data:0 rank: 1 element: 2 dest_vector%used_data: 1120 rank: 1 element: 3 dest_vector%used_data:2 rank: 1 element: 4 dest_vector%used_data:3 rank: 2 element: 1 dest_vector%used_data: 1210 rank: 2 element: 2 dest_vector%used_data: 1220 rank: 2 element:
Re: [OMPI users] SLURM vs. Torque? [OT]
On Monday 22 October 2007, Bill Johnstone wrote: > Hello All. > > We are starting to need resource/scheduling management for our small > cluster, and I was wondering if any of you could provide comments on > what you think about Torque vs. SLURM? On the basis of the appearance > of active development as well as the documentation, SLURM seems to be > superior, but can anyone shed light on how they compare in use? I won't attempt a full analysis but here are two small (random) crumbs of information. 1) Slurm keeps the name of stuff sepparate from the contact address (ControlMachine=hostname, ControlAddr=IP/whatever). This alone wins my heart any day of the week. 2) The scheduler can be a weak point for slurm. If you can live with the built in trivial one then great. If you need more and happen to find something that is free and works (or writes one yourself) then let me know ;-) /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] Performance tuning: focus on latency
On Wednesday 25 July 2007, Jeff Squyres wrote: > On Jul 25, 2007, at 7:45 AM, Biagio Cosenza wrote: > > Jeff, I did what you suggested > > > > However no noticeable changes seem to happen. Same peaks and same > > latency times. > > Ok. This suggests that Nagle may not be the issue here. My guess would be that there are some nasty dead animals burried in the network. The op mentioned 200 ms, that's enough time to cross continents, not a time you'd expect in the same sentence as "cluster" and "latency". ...on the other hand we don't really know what the benchmark case was. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] IB bandwidth vs. kernels
On Thursday 18 January 2007 13:08, Scott Atchley wrote: ... > The best uni-directional performance I have heard of for PCIe 8x IB > DDR is ~1,400 MB/s (11.2 Gb/s) This is on par with what I have seen. > with Lustre, which is about 55% of the > theoretical 20 Gb/s advertised speed. I think this should be calculated against 16 Gbps, not 20 Gbps. > The ~900 MB/s (7.2 Gb/s) > mentioned above is, of course, ~72% of advertised speed. If any IB > folks have any better numbers, please correct me. Using MPI (over a non idle multi-level switch) I get 940 * 10^6 Bytes/s which is 94% of peak for that IB 4x SDR. > The data throughput limit for 8x PCIe is ~12 Gb/s. The theoretical > limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte > overhead. If the adapter uses 64 byte packets, then you see 1/3 of > the throughput go to overhead. AFAIK the datafield of a pci-express packet is 0-4096 bytes and the header a bit more than 20 bytes (including things such as start/stop frame bytes, LCRC/ECRC..). This gives a maximum speed over 4x PCIe of 993.3 10^6 Bytes/s (8 Gbps after coding minus header waste for a full 4096 byte payload). In short, the SDR IB equipment I have seen has easily reached 90%+ while PCI-express on the platforms I've tried has been limited to ~75%. Current IB DDR HCAs are probably limited by (at least) PCI-express 8x. /Peter pgphFCXjmUXlv.pgp Description: PGP signature
Re: [OMPI users] IB bandwidth vs. kernels
On Thursday 18 January 2007 09:52, Robin Humble wrote: ... > is ~10Gbit the best I can expect from 4x DDR IB with MPI? > some docs @HP suggest up to 16Gbit (data rate) should be possible, and > I've heard that 13 or 14 has been achieved before. but those might be > verbs numbers, or maybe horsepower >> 4 cores of 2.66GHz core2 is > required? The 16 Gbit/s number is the theoretical peak, IB is coded 8/10 so out of the 20 Gbit/s 16 is what you get. On SDR this number is (of course) 8 Gbit/s achievable (which is ~1000 MB/s) and I've seen well above 900 on MPI (this on 8x PCIe, 2x margin). The same setup on 4x PCIe stops at a bit over 700 MB/s (for a certain PCIe chipset) so it makes some sense that an IB 4x DDR on 8x PICe would be limited to about 1500 MB/s (on that platform). All this ignoring possible MPI bottle necks above 900 MB/s and assuming the IB fabric can reach 95%+ of peak on DDR as it does on SDR... /Peter pgpbr0t20Dzp_.pgp Description: PGP signature
Re: [OMPI users] OpenMPI on HPUX?
On Tuesday 16 January 2007 15:37, Brian W. Barrett wrote: > Open MPI will not run on PA-RISC processors. HPUX runs on IA-64 too. /Peter pgpdAr7FqFgzB.pgp Description: PGP signature