Re: [OMPI users] Program deadlocks, on simple send/recv loop
vasilis gkanis wrote: I had a similar problem with the portland Fortran compiler. I new that this was not caused by a network problem ( I run the code on a single node with 4 CPUs). After I tested pretty much anything, I decided to change the compiler. I used the Intel Fortran compiler and everything is running fine. It could be a PGI compiler voodoo :) There were some thoughts on this e-mail thread that the problem could be related to trac ticket 2043. Note that there has been progress on this ticket. See https://svn.open-mpi.org/trac/ompi/ticket/2043#comment:18 . The shared-memory (on-node) communications were subject to race conditions that could be exposed by optimizing compilers. Some signals could have gotten lost in inter-process communications, quite possibly leading to hangs. If you think you got bitten by this bug, please try the revisions mentioned in the trac ticket and report your success (or, alas, failure) via the trac ticket or as appropriate.
[OMPI users] VampirTrace: time not increasing
I got the following problem while trying to run vt-enabled HPL benchmark on a single 8-core Linux node. OTF ERROR in function OTF_WBuffer_setTimeAndProcess, file: OTF_WBuffer.c, line: 308: time not increasing. (t= 2995392288, p= 2) vtunify: Error: Could not read events of OTF stream [namestub ./a__ufy.tmp id 1] vtunify: An error occurred during unifying events - Terminating ... Sometimes instead of above message I get this: vtunify: vt_unify_events_hdlr.cc:37: int Handle_Enter(OTF_WStream*, uint64_t, uint32_t, uint32_t, uint32_t): Assertion `global_statetoken != 0' failed. A program is automatically instrumented (all I did was changing mpicc to mpicc-vt), compiled and run with latest svn of Open MPI, command: mpirun -np 8 --mca btl self,sm hpl The same problem was with the latest release version. When I run it with less number of processes, it works fine. Any ideas? -- Roman I. Cheplyaka
Re: [OMPI users] MTT -trivial :All tests are not getting passed
Hi Vishal, This is an MTT question for mtt-us...@open-mpi.org (see comments below). On Tue, Dec/22/2009 03:54:08PM, vishal shorghar wrote: >Hi All, > >I have one issue with MTT trivial tests.All tests are not getting >passed,Please read below for detailed description. > >Today I ran mtt trivial tests with latest ofed package >OFED-1.5-20091217-0600 (ompi-1.4), between two machines,I was able to run >the MTT trivial tests manually but not through MTT framework. I think we >are missing some configuration steps since it is unable to find the test >executables in the test run phase of the MTT. > >-> When we ran it through MTT it gave us the error and exits. >I ran the test as "cat developer.ini trivial.ini | ../client/mtt >--verbose - " > >-> When we analyzed error from >/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt file we >found it is not getting the executable files of the different test to >execute. > >-> Then we found that those executables were being generated only on one >of the machine out of two machines. So, we manually copied the tests from >/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring >to another machine. > >-> And we ran it manually as shown below and it worked fine: >mpirun --host 102.77.77.64,102.77.77.68 -np 2 --mca btl openib,sm,self >--prefix > > /usr/mpi/gcc/openmpi-1.4/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring > >-> I am attaching file trivial.ini,developer.ini and >/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt. > >Let us know if I am missing some configuration steps. > You need to set your scratch directory (via the --scratch option) to an NFS share that is accessible to all nodes in your hostlist. MTT won't copy local files onto each node for you. Regards, Ethan >NOTE: > >It gave me following output at the end of execution of test command and >the same is saved in /root/mtt-svn/samples/All_phase-summary.txt > >hostname: nizam >uname: Linux nizam 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 >x86_64 x86_64 x86_64 GNU/Linux >who am i: > > > +-+-+-+--+--+--+--+--+--+ >| Phase | Section | MPI Version | Duration | Pass | Fail | >Time out | Skip | Detailed report | > > +-+-+-+--+--+--+--+--+--+ >| MPI Install | my installation | 1.4 | 00:00| 1| >| | | MPI_Install-my_installation-my_installation-1.4.html | >| Test Build | trivial | 1.4 | 00:01| 1| >| | | Test_Build-trivial-my_installation-1.4.html | >| Test Run| trivial | 1.4 | 00:10| | 8 >| | | Test_Run-trivial-my_installation-1.4.html| > > +-+-+-+--+--+--+--+--+--+ > >Total Tests:10 >Total Failures: 8 >Total Passed: 2 >Total Duration: 11 secs. (00:11) > >Thanks & Regards, > >Vishal shorghar >MTS >Chelsio Communication > # > # Copyright (c) 2007 Sun Microystems, Inc. All rights reserved. > # > > # Template MTT configuration file for Open MPI developers. The intent > # for this template file is to establish at least some loose > # guidelines for what Open MPI core developers should be running > # before committing changes to the OMPI repository. This file is not > # intended to be an exhaustive sample of all possible fields and > # values that MTT offers. Each developer will undoubtedly have to > # edit this template for their own needs (e.g., pick compilers to use, > # etc.), but this file provides a baseline set of configurations that > # we intend for you to run. > # > # Sample usage: > # cat developer.ini intel.ini | client/mtt - > alreadyinstalled_dir=/your/install > # cat developer.ini trivial.ini | client/mtt - > alreadyinstalled_dir=/your/install > # > > [MTT] > # No overrides to defaults > > # Fill this field in > > #hostlist = 102.77.77.63 102.77.77.54 102.77.77.64 102.77.77.68 > #hostlist = 102.77.77.66 102.77.77.68 102.77.77.63 102.77.77.64 102.77.77.53 > 102.77.77.54 102.77.77.243 102.77.77.65 > hostlist = 102.77.77.64 102.77.77.68 > hostlist_max_np = 2 > max_np = 2 > force = 1 > #prefix = /usr/mpi/gcc/openmpi-1.3.4/bin > > #-- > > [MPI Details: Open MPI] > > exec = mpirun @hosts@ -np &test_np() @mca@ --prefix &test_prefix() > &test_executable() &test_
Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all
Hi Ralph, Somehow I did not receive your last answer as mail, so I reply to myself... Thanks for the explanation. I thought that the prefix issue would be handled by the OMPI configure parameter "enable-mpirun-prefix-by-default". But now I see your point. Anyway, I did not find any further information regarding that issue on the torque FAQ, and since the rsh launcher works I will stick to that and dont spend more time in experiments with torque... Thanks again for your help! Greetings Johann Johann Knechtel schrieb: > Ralph, thank you very much for your input! The parameter "mca plm rsh" > did it. I am just curious about the reasons for that behavior? > You can find the complete output of the different commands embedded in > your mail below. The first line states the successful load of the OMPI > environment, we use the modules package on our cluster. > > Greetings > Johann > > > Ralph Castain schrieb: >> Sorry - hit "send" and then saw the version sitting right there in the >> subject! Doh... >> >> First, let's try verifying what components are actually getting used. Run >> this: >> >> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted >> > OpenMPI with PPU-GCC was loaded > [node1:00706] mca: base: components_open: Looking for plm components > [node1:00706] mca: base: components_open: opening plm components > [node1:00706] mca: base: components_open: found loaded component rsh > [node1:00706] mca: base: components_open: component rsh has no register > function > [node1:00706] mca: base: components_open: component rsh open function > successful > [node1:00706] mca: base: components_open: found loaded component slurm > [node1:00706] mca: base: components_open: component slurm has no > register function > [node1:00706] mca: base: components_open: component slurm open function > successful > [node1:00706] mca: base: components_open: found loaded component tm > [node1:00706] mca: base: components_open: component tm has no register > function > [node1:00706] mca: base: components_open: component tm open function > successful > [node1:00706] mca:base:select: Auto-selecting plm components > [node1:00706] mca:base:select:( plm) Querying component [rsh] > [node1:00706] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [node1:00706] mca:base:select:( plm) Querying component [slurm] > [node1:00706] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > [node1:00706] mca:base:select:( plm) Querying component [tm] > [node1:00706] mca:base:select:( plm) Query of component [tm] set > priority to 75 > [node1:00706] mca:base:select:( plm) Selected component [tm] > [node1:00706] mca: base: close: component rsh closed > [node1:00706] mca: base: close: unloading component rsh > [node1:00706] mca: base: close: component slurm closed > [node1:00706] mca: base: close: unloading component slurm > [node1:00706] mca: base: components_open: Looking for ras components > [node1:00706] mca: base: components_open: opening ras components > [node1:00706] mca: base: components_open: found loaded component slurm > [node1:00706] mca: base: components_open: component slurm has no > register function > [node1:00706] mca: base: components_open: component slurm open function > successful > [node1:00706] mca: base: components_open: found loaded component tm > [node1:00706] mca: base: components_open: component tm has no register > function > [node1:00706] mca: base: components_open: component tm open function > successful > [node1:00706] mca:base:select: Auto-selecting ras components > [node1:00706] mca:base:select:( ras) Querying component [slurm] > [node1:00706] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [node1:00706] mca:base:select:( ras) Querying component [tm] > [node1:00706] mca:base:select:( ras) Query of component [tm] set > priority to 100 > [node1:00706] mca:base:select:( ras) Selected component [tm] > [node1:00706] mca: base: close: unloading component slurm > /opt/openmpi_1.3.4_gcc_ppc/bin/orted > [node1:00706] mca: base: close: unloading component tm > [node1:00706] mca: base: close: component tm closed > [node1:00706] mca: base: close: unloading component tm > >> Then get an allocation and run >> >> mpirun -pernode which orted >> > OpenMPI with PPU-GCC was loaded > -- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > ---
Re: [OMPI users] Is OpenMPI's orted = MPICH2's smpd?
On Tue, 2009-12-22 at 09:59 +0530, Sangamesh B wrote: > Hi, > > MPICh2 has different process managers: MPD, SMPD, GFORKER etc. It also has Hydra. > Is the Open MPI's startup daemon orted similar to MPICH2's smpd? Or > something else? My understand is that SMPD is for launching on Windows which isn't something I know about. orte is similar to MPD although without the requirement that you start the ring before-hand. A quick summary of orte: Orte takes a list of nodes and a process count, given these it will start a job of the given size on the given nodes. No prior configuration or starting of deamons is required. No effort is made to prevent multiple jobs from starting on the same nodes and no effort is made to maintain a "queue" of jobs waiting for nodes to become free. Each job is independent, and runs where you tell it to immediately. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] Is OpenMPI's orted = MPICH2's smpd?
Afraid I don't know enough about MPICH2's different process managers (and why they need more than one) to answer that question. On Dec 21, 2009, at 9:29 PM, Sangamesh B wrote: > Hi, > > MPICh2 has different process managers: MPD, SMPD, GFORKER etc. Is the > Open MPI's startup daemon orted similar to MPICH2's smpd? Or something else? > > Thanks, > Sangamesh > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Memory corruption?
Hi. We have started to scale up one of our codes and sometimes we get messages like this: [c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from the registration cache. Possible memory corruption. It seems like the application runs normally and it does not crash becaus of this. Should we be worried? We have tested the code with up to 1700 cores and the message becomes more frequent as we scale up. System details: Rocks 5.2 (aka CentOS 5.3) x86_64 INTEL Compiler 11.1 OFED 1.4.1 OpenMPI 1.3.3 Best regards and Merry Christmas to all, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
[OMPI users] MTT -trivial :All tests are not getting passed
*Hi All,* I have one issue with MTT trivial tests.All tests are not getting passed,Please read below for detailed description. Today I ran mtt trivial tests with latest ofed package OFED-1.5-20091217-0600 (ompi-1.4), between two machines,I was able to run the MTT trivial tests manually but not through MTT framework. I think we are missing some configuration steps since it is unable to find the test executables in the test run phase of the MTT. -> When we ran it through MTT it gave us the error and exits. I ran the test as "cat developer.ini trivial.ini | ../client/mtt --verbose - " -> When we analyzed error from /root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt file we found it is not getting the executable files of the different test to execute. -> Then we found that those executables were being generated only on one of the machine out of two machines. So, we manually copied the tests from /root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring to another machine. -> And we ran it manually as shown below and it worked fine: mpirun --host 102.77.77.64,102.77.77.68 -np 2 --mca btl openib,sm,self --prefix /usr/mpi/gcc/openmpi-1.4/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring -> I am attaching file trivial.ini,developer.ini and /root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt. Let us know if I am missing some configuration steps. NOTE: It gave me following output at the end of execution of test command and the same is saved in /root/mtt-svn/samples/All_phase-summary.txt hostname: nizam uname: Linux nizam 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux who am i: +-+-+-+--+--+--+--+--+--+ | Phase | Section | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report | +-+-+-+--+--+--+--+--+--+ | MPI Install | my installation | 1.4 | 00:00| 1| | | | MPI_Install-my_installation-my_installation-1.4.html | | Test Build | trivial | 1.4 | 00:01| 1| | | | Test_Build-trivial-my_installation-1.4.html | | Test Run| trivial | 1.4 | 00:10| | 8 | | | Test_Run-trivial-my_installation-1.4.html| +-+-+-+--+--+--+--+--+--+ Total Tests:10 Total Failures: 8 Total Passed: 2 Total Duration: 11 secs. (00:11) Thanks & Regards, Vishal shorghar MTS Chelsio Communication # # Copyright (c) 2007 Sun Microystems, Inc. All rights reserved. # # Template MTT configuration file for Open MPI developers. The intent # for this template file is to establish at least some loose # guidelines for what Open MPI core developers should be running # before committing changes to the OMPI repository. This file is not # intended to be an exhaustive sample of all possible fields and # values that MTT offers. Each developer will undoubtedly have to # edit this template for their own needs (e.g., pick compilers to use, # etc.), but this file provides a baseline set of configurations that # we intend for you to run. # # Sample usage: # cat developer.ini intel.ini | client/mtt - alreadyinstalled_dir=/your/install # cat developer.ini trivial.ini | client/mtt - alreadyinstalled_dir=/your/install # [MTT] # No overrides to defaults # Fill this field in #hostlist = 102.77.77.63 102.77.77.54 102.77.77.64 102.77.77.68 #hostlist = 102.77.77.66 102.77.77.68 102.77.77.63 102.77.77.64 102.77.77.53 102.77.77.54 102.77.77.243 102.77.77.65 hostlist = 102.77.77.64 102.77.77.68 hostlist_max_np = 2 max_np = 2 force = 1 #prefix = /usr/mpi/gcc/openmpi-1.3.4/bin #-- [MPI Details: Open MPI] exec = mpirun @hosts@ -np &test_np() @mca@ --prefix &test_prefix() &test_executable() &test_argv() mca = --mca btl openib,sm,self hosts = < +--+---+ | Field| Value | +--+---+ | description | | | environment | | | exit_signal | -1