Mark, Exciting.. SOLVED.. There is an open ticket #2043 regarding Nehelem/OpenMPI/Hang problem (https://svn.open-mpi.org/trac/ompi/ticket/2043).. Seems like the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems like there is a group up us with dual socket nehalems trying to use ompi without much luck (or at least not without headaches)..
Of note, mca btl_sm_num_fifos 7 seems to work as well.. now off to see if I can get some real code to work... Thanks, Mark, Gus, and the rest of the OMPI Users Group! On Dec 10, 2009, at 7:42 AM, Mark Bolstad wrote: > > Just a quick interjection, I also have a dual-quad Nehalem system, HT on, > 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads > --enable-mpi-f77=no --with-openib=no > > With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity > fails randomly with np >= 8. Turning on -v increased the success, but still > hangs. np = 16 fails more often, and the hang is random in which pair of > processes are communicating. > > However, it seems to be related to the shared memory layer problem. Running > with -mca btl ^sm works consistently through np = 128. > > Hope this helps. > > Mark > > On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Hi Matthew > > Save any misinterpretation I may have made of the code: > > Hello_c has no real communication, except for a final Barrier > synchronization. > Each process prints "hello world" and that's it. > > Ring probes a little more, with processes Send(ing) and > Recv(cieving) messages. > Ring just passes a message sequentially along all process > ranks, then back to rank 0, and repeat the game 10 times. > Rank 0 is in charge of counting turns, decrementing the counter, > and printing that (nobody else prints). > With 4 processes: > 0->1->2->3->0->1... 10 times > > In connectivity every pair of processes exchange a message. > Therefore it probes all pairwise connections. > In verbose mode you can see that. > > These programs shouldn't hang at all, if the system were sane. > Actually, they should even run with a significant level of > oversubscription, say, > -np 128 should work easily for all three programs on a powerful > machine like yours. > > > ** > > Suggestions > > 1) Stick to the OpenMPI you compiled. > > ** > > 2) You can run connectivity_c in verbose mode: > > home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v > > (Note the trailing "-v".) > > It should tell more about who's talking to who. > > ** > > 3) I wonder if there are any BIOS settings that may be required > (and perhaps not in place) to make the Nehalem hyperthreading to > work properly in your computer. > > You reach the BIOS settings by typing <DEL> or <F2> > when the computer boots up. > The key varies by > BIOS and computer vendor, but shows quickly on the bootup screen. > > You may ask the computer vendor about the recommended BIOS settings. > If you haven't done this before, be careful to change and save only > what really needs to change (if anything really needs to change), > or the result may be worse. > (Overclocking is for gamers, not for genome researchers ... :) ) > > ** > > 4) What I read about Nehalem DDR3 memory is that it is optimal > on configurations that are multiples of 3GB per CPU. > Common configs. in dual CPU machines like yours are > 6, 12, 24 and 48GB. > The sockets where you install the memory modules also matter. > > Your computer has 20GB. > Did you build the computer or upgrade the memory yourself? > Do you know how the memory is installed, in which memory sockets? > What does the vendor have to say about it? > > See this: > http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx > > ** > > 5) As I said before, typing "f" then "j" on "top" will add > a column (labeled "P") that shows in which core each process is running. > This will let you observe how the Linux scheduler is distributing > the MPI load across the cores. > Hopefully it is load-balanced, and different processes go to different > cores. > > *** > > It is very disconcerting when MPI processes hang. > You are not alone. > The reasons are not always obvious. > At least in your case there is no network involved or to troubleshoot. > > > ** > > I hope it helps, > > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > > > > Matthew MacManes wrote: > Hi Gus and List, > > 1st of all Gus, I want to say thanks.. you have been a huge help, and when I > get this fixed, I owe you big time! > > However, the problems continue... > > I formatted the HD, reinstalled OS to make sure that I was working from > scratch. I did your step A, which seemed to go fine: > > macmanes@macmanes:~$ which mpicc > /home/macmanes/apps/openmpi1.4/bin/mpicc > macmanes@macmanes:~$ which mpirun > /home/macmanes/apps/openmpi1.4/bin/mpirun > > Good stuff there... > > I then compiled the example files: > > macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ > /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c > Process 0 sending 10 to 1, tag 201 (8 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > Process 1 exiting > Process 2 exiting > Process 3 exiting > Process 4 exiting > Process 5 exiting > Process 6 exiting > Process 7 exiting > macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ > /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c > Connectivity test on 8 processes PASSED. > macmanes@macmanes:~/Downloads/openmpi-1.4/examples$ > /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c > ..HANGS..NO OUTPUT > > this is maddening because ring_c works.. and connectivity_c worked the 1st > time, but not the second... I did it 10 times, and it worked twice.. here is > the TOP screenshot: > > http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394 > > What is the difference between connectivity_c and ring_c? Under what > circumstances should one fail and not the other... > > I'm off to the Linux forums to see about the Nehalem kernel issues.. > > Matt > > > > On Wed, Dec 9, 2009 at 13:25, Gus Correa <g...@ldeo.columbia.edu > <mailto:g...@ldeo.columbia.edu>> wrote: > > Hi Matthew > > There is no point in trying to troubleshoot MrBayes and ABySS > if not even the OpenMPI test programs run properly. > You must straighten them out first. > > ** > > Suggestions: > > ** > > A) While you are at OpenMPI, do yourself a favor, > and install it from source on a separate directory. > Who knows if the OpenMPI package distributed with Ubuntu > works right on Nehalem? > Better install OpenMPI yourself from source code. > It is not a big deal, and may save you further trouble. > > Recipe: > > 1) Install gfortran and g++ if you don't have them using apt-get. > 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi > 3) Make another install directory *not in the system directory tree*. > Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version) > will work > 4) cd /home/matt/downolads/openmpi > 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran \ > --prefix=/home/matt/apps/openmpi-X.Y.Z > (Use the prefix flag to install in the directory of item 3.) > 6) make > 7) make install > 8) At the bottom of your /home/matt/.bashrc or .profile file > put these lines: > > export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH} > export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w` > export > LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH} > > (If you use csh/tcsh use instead: > setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH} > etc) > > 9) Logout and login again to freshen um the environment variables. > 10) Do "which mpicc" to check that it is pointing to your newly > installed OpenMPI. > 11) Recompile and rerun the OpenMPI test programs > with 2, 4, 8, 16, .... processors. > Use full path names to mpicc and to mpirun, > if the change of PATH above doesn't work right. > > ******** > > B) Nehalem is quite new hardware. > I don't know if the Ubuntu kernel 2.6.31-16 fully supports all > of Nehalem features, particularly hyperthreading, and NUMA, > which are used by MPI programs. > I am not the right person to give you advice about this. > I googled out but couldn't find a clear information about > minimal kernel age/requirements to have Nehalem fully supported. > Some Nehalem owner in the list could come forward and tell. > > ** > > C) On the top screenshot you sent me, please try it again > (after you do item A) but type "f" and "j" to show the processors > that are running each process. > > ** > > D) Also, the screeshot shows 20GB of memory. > This sounds not as a optimal memory for Nehalem, > which tend to be 6GB, 12GB, 24GB, 48GB. > Did you put together the system, or upgraded the memory yourself, > of did you buy the computer as is? > However, this should not break MPI anyway. > > ** > > E) Answering your question: > It is true that different flavors of MPI > used to compile (mpicc) and run (mpiexec) a program would probably > break right away, regardless of the number of processes. > However, when it comes to different versions of the > same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3) > I am not sure it will break. > I would guess it may run but not in a reliable way. > Problems may appear as you stress the system with more cores, etc. > But this is just a guess. > > ** > > I hope this helps, > > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > Matthew MacManes wrote: > > Hi Gus, > > Interestingly the results for the connectivity_c test... works > fine with -np <8. For -np >8 it works some of the time, other > times it HANGS. I have got to believe that this is a big clue!! > Also, when it hangs, sometimes I get the message "mpirun was > unable to cleanly terminate the daemons on the nodes shown > below" Note that NO nodes are shown below. Once, I got -np 250 > to pass the connectivity test, but I was not able to replicate > this reliable, so I'm not sure if it was a fluke, or what. Here > is a like to a screenshop of TOP when connectivity_c is hung > with -np 14.. I see that 2 processes are only at 50% CPU usage.. > Hmmmm > http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink > > <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink> > > <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink > > <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink>> > > > The other tests, ring_c, hello_c, as well as the cxx versions of > these guys with with all values of -np. > > Using -mca mpi-paffinity_alone 1 I get the same behavior. > I agree that I am should worry about the mismatch between where > the libraries are installed versus where I am telling my > programs to look for them. Would this type of mismatch cause > behavior like what I am seeing, i.e. working with a small > number of processors, but failing with larger? It seems like a > mismatch would have the same effect regardless of the number of > processors used. Maybe I am mistaken. Anyway, to address this, > which mpirun gives me /usr/local/bin/mpirun.. so to configure > ./configure --with-mpi=/usr/local/bin/mpirun and to run > /usr/local/bin/mpirun -np X ... This should > uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu > SMP Thu Dec 3 22:07:16 UTC 2006 x86_64 GNU/Linux > > Matt > > On Dec 8, 2009, at 8:50 PM, Gus Correa wrote: > > Hi Matthew > > Please see comments/answers inline below. > > Matthew MacManes wrote: > > Hi Gus, Thanks for your ideas.. I have a few questions, > and will try to answer yours in hopes of solving this!! > > > A simple way to test OpenMPI on your system is to run the > test programs that come with the OpenMPI source code, > hello_c.c, connectivity_c.c, and ring_c.c: > http://www.open-mpi.org/ > > Get the tarball from the OpenMPI site, gzip and untar it, > and look for it in the "examples" directory. > Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c > Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out > using X = 2, 4, 8, 16, 32, 64, ... > > This will tell if your OpenMPI is functional, > and if you can run on many Nehalem cores, > even with oversubscription perhaps. > It will also set the stage for further investigation of your > actual programs. > > > Should I worry about setting things like --num-cores > --bind-to-cores? This, I think, gets at your questions > about processor affinity.. Am I right? I could not > exactly figure out the -mca mpi-paffinity_alone stuff... > > > I use the simple minded -mca mpi-paffinity_alone 1. > This is probably the easiest way to assign a process to a core. > There more complex ways in OpenMPI, but I haven't tried. > Indeed, -mca mpi-paffinity_alone 1 does improve performance of > our programs here. > There is a chance that without it the 16 virtual cores of > your Nehalem get confused with more than 3 processes > (you reported that -np > 3 breaks). > > Did you try adding just -mca mpi-paffinity_alone 1 to > your mpiexec command line? > > > 1. Additional load: nope. nothing else, most of the time > not even firefox. > > > Good. > Turn off firefox, etc, to make it even better. > Ideally, use runlevel 3, no X, like a computer cluster node, > but this may not be required. > > 2. RAM: no problems apparent when monitoring through > TOP. Interesting, I did wonder about oversubscription, > so I tried the option --nooversubscription, but this > gave me an error mssage. > > > Oversubscription from your program would only happen if > you asked for more processes than available cores, i.e., > -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading, > -np > 16). > Since you have -np=4 there is no oversubscription, > unless you have other external load (e.g. Matlab, etc), > but you said you don't. > > Yet another possibility would be if your program is threaded > (e.g. using OpenMP along with MPI), but considering what you > said about OpenMP I would guess the programs don't use it. > For instance, you launch the program with 4 MPI processes, > and each process decides to start, say, 8 OpenMP threads. > You end up with 32 threads and 8 (real) cores (or 16 > hyperthreaded > ones on Nehalem). > > > What else does top say? > Any hog processes (memory- or CPU-wise) > besides your program processes? > > 3. I have not tried other MPI flavors.. Ive been > speaking to the authors of the programs, and they are > both using openMPI. > > I was not trying to convince you to use another MPI. > I use MPICH2 also, but OpenMPI reigns here. > The idea or trying it with MPICH2 was just to check whether > OpenMPI > is causing the problem, but I don't think it is. > > 4. I don't think that this is a problem, as I'm > specifying --with-mpi=/usr/bin/... when I compile the > programs. Is there any other way to be sure that this is > not a problem? > > > Hmmm .... > I don't know about your Ubuntu (we have CentOS and Fedora on > various > machines). > However, most Linux distributions come with their MPI flavors, > and so do compilers, etc. > Often times they install these goodies in unexpected places, > and this has caused a lot of frustration. > There are tons of postings on this list that eventually > boiled down to mismatched versions of MPI in unexpected places. > > > The easy way is to use full path names to compile and to run. > Something like this: > /my/openmpi/bin/mpicc on your program configuration script), > > and something like this > /my/openmpi/bin/mpiexec -np ... bla, bla ... > when you submit the job. > > You can check your version with "which mpicc", "which mpiexec", > and (perhaps using full path names) with > "ompi_info", "mpicc --showme", "mpiexec --help". > > > 5. I had not been, and you could see some shuffling when > monitoring the load on specific processors. I have tried > to use --bind-to-cores to deal with this. I don't > understand how to use the -mca options you asked about. > 6. I am using Ubuntu 9.10. gcc 4.4.1 and g++ 4.4.1 > > > I am afraid I won't be of help, because I don't have Nehalem. > However, I read about Nehalem requiring quite recent kernels > to get all of its features working right. > > What is the output of "uname -a"? > This will tell the kernel version, etc. > Other list subscribers may give you a suggestion if you post the > information. > > MyBayes is a for bayesian phylogenetics: > http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page > ABySS: is a program for assembly of DNA sequence data: > http://www.bcgsc.ca/platform/bioinfo/software/abyss > > > Thanks for the links! > I had found the MrBayes link. > I eventually found what your ABySS was about, but no links. > Amazing that it is about DNA/gene sequencing. > Our abyss here is the deep ocean ... :) > Abysmal difference! > > Do the programs mix MPI (message passing) with > OpenMP (threads)? > > Im honestly not sure what this means.. > > > Some programs mix the two. > OpenMP only works in a shared memory environment (e.g. a single > computer like yours), whereas MPI can use both shared memory > and work across a network (e.g. in a cluster). > There are other differences too. > > Unlikely that you have this hybrid type of parallel program, > otherwise there would be some reference to OpenMP > on the very program configuration files, program > documentation, etc. > Also, in general the configuration scripts of these hybrid > programs can turn on MPI only, or OpenMP only, or both, > depending on how you configure. > > Even to compile with OpenMP you would need a proper compiler > flag, but that one might be hidden in a Makefile too, making > a bit hard to find. "grep -n mp Makefile" may give a clue. > Anything on the documentation that mentions threads or OpenMP? > > FYI, here is OpenMP: > http://openmp.org/wp/ > > Thanks for all your help! > > > Matt > > Well, so far it didn't really help. :( > > But let's hope to find a clue, > maybe with a little help of > our list subscriber friends. > > Gus Correa > > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > > --------------------------------------------------------------------- > > Hi Matthew > > More guesses/questions than anything else: > > 1) Is there any additional load on this machine? > We had problems like that (on different machines) when > users start listening to streaming video, doing > Matlab calculations, > etc, while the MPI programs are running. > This tends to oversubscribe the cores, and may lead > to crashes. > > 2) RAM: > Can you monitor the RAM usage through "top"? > (I presume you are on Linux.) > It may show unexpected memory leaks, if they exist. > > On "top", type "1" (one) see all cores, type "f" > then "j" > to see the core number associated to each process. > > 3) Do the programs work right with other MPI flavors > (e.g. MPICH2)? > If not, then it is not OpenMPI's fault. > > 4) Any possibility that the MPI versions/flavors of > mpicc and > mpirun that you are using to compile and launch the > program are not the > same? > > 5) Are you setting processor affinity on mpiexec? > > mpiexec -mca mpi_paffinity_alone 1 -np ... bla, bla ... > > Context switching across the cores may also cause > trouble, I suppose. > > 6) Which Linux are you using (uname -a)? > > On other mailing lists I read reports that only > quite recent kernels > support all the Intel Nehalem processor features well. > I don't have Nehalem, I can't help here, > but the information may be useful > for other list subscribers to help you. > > *** > > As for the programs, some programs require specific > setup, > (and even specific compilation) when the number of > MPI processes > vary. > It may help if you tell us a link to the program sites. > > Baysian statistics is not totally out of our business, > but phylogenetic genetic trees is not really my league, > hence forgive me any bad guesses, please, > but would it need specific compilation or a different > set of input parameters to run correctly on a different > number of processors? > Do the programs mix MPI (message passing) with > OpenMP (threads)? > > I found this MrBayes, which seems to do the above: > > http://mrbayes.csit.fsu.edu/ > http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page > > As for the ABySS, what is it, where can it be found? > Doesn't look like a deep ocean circulation model, as > the name suggest. > > My $0.02 > Gus Correa > > > ------------------------------------------------------------------------ > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _________________________________ > Matthew MacManes > PhD Candidate > University of California- Berkeley > Museum of Vertebrate Zoology > Phone: 510-495-5833 > Lab Website: http://ib.berkeley.edu/labs/lacey > Personal Website: http://macmanes.com/ > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------------------------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _________________________________ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/