The example you list below has all MPICH paths -- I don't see any Open MPI setups in there.
What I was suggesting was that if you absolutely need to have both Open MPI and MPICH installed and in your PATH / LD_LIBRARY_PATH / MANPATH, then you can use the full, absolute path name to each of the Open MPI executables -- e.g., /path/to/openmpi/install/bin/mpicc, etc. That way, you can use Open MPI's mpicc without having it in your path. Additionally, per https://www.open-mpi.org/faq/?category=running#mpirun-prefix, if you specify the absolute path name to mpirun (or mpiexec -- they're identical in Open MPI) and you're using the rsh/ssh launcher in Open MPI, then Open MPI will set the right PATH / LD_LIBRARY_PATH on remote servers for you. See the FAQ link for more detail. > On Jun 1, 2016, at 8:41 AM, Megdich Islem <megdich_is...@yahoo.fr> wrote: > > Hi! > > Thank you Jeff for you suggestion. But, I am still not able to understand > what do you mean by using absolute path names to for > mpicc/mpifort-mpirun/mpiexec ? > > This is how my .bashrc looks like > > source /opt/openfoam30/etc/bashrc > > export PATH=/home/Desktop/mpich/bin:$PATH > export LD_LIBRARY_PATH="/home/islem/Desktop/mpich/lib/:$LD_LIBRARY_PATH" > export MPICH_F90=gfortran > export MPICH_CC=/opt/intel/bin/icc > export MPICH_CXX=/opt/intel/bin/icpc > export MPICH_LINK_CXX="-L/home/Desktop/mpich/lib/ -Wl,-rpath > -Wl,/home/islem/Desktop/mpich/lib -lmpichcxx -lmpich -lopa -lmpl -lrt > -lpthread" > > export PATH=$PATH:/opt/intel/bin/ > LD_LIBRARY_PATH="/opt/intel/lib/intel64:$LD_LIBRARY_PATH" > export LD_LIBRARY_PATH > source > /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/mpivars.sh > intel64 > > alias startEMPIRE=". /home/islem/software/empire/EMPIRE-Core/etc/bashrc.sh > ICC" > > mpirun --version gives mpich 3.0.4 > > This is how I run one example that couples 2 clients through the server > EMPIRE. > I use three terminals, in each I write one of these command lines > > mpiexec -np 1 Emperor emperorInput.xml (I got a message in the terminal > saying that Empire started) > > mpiexec -np 1 dummyCSM dummyCSMInput (I get a message that Emperor > acknowledged connection) > mpiexec -np 1 pimpleDyMFoam -case OF (I got no message in the terminal which > means no connection) > > How can I use the mpirun and where to right any modifications ? > > Regards, > Islem > > > Le Vendredi 27 mai 2016 17h00, "users-requ...@open-mpi.org" > <users-requ...@open-mpi.org> a écrit : > > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: users Digest, Vol 3510, Issue 2 (Jeff Squyres (jsquyres)) > 2. Re: segmentation fault for slot-list and openmpi-1.10.3rc2 > (Siegmar Gross) > 3. OpenMPI virtualization aware (Marco D'Amico) > 4. Re: OpenMPI virtualization aware (Ralph Castain) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 26 May 2016 23:28:17 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Megdich Islem <megdich_is...@yahoo.fr>, "Open MPI User's List" > <us...@open-mpi.org> > Cc: Dave Love <d.l...@liverpool.ac.uk> > Subject: Re: [OMPI users] users Digest, Vol 3510, Issue 2 > Message-ID: <441f803d-fdbb-443d-82aa-74ff3845a...@cisco.com> > Content-Type: text/plain; charset="utf-8" > > You're still intermingling your Open MPI and MPICH installations. > > You need to ensure to use the wrapper compilers and mpirun/mpiexec from the > same MPI implementation. > > For example, if you use mpicc/mpifort from Open MPI to build your program, > then you must use Open MPI's mpirun/mpiexec. > > If you absolutely need to have both MPI implementations in your PATH / > LD_LIBRARY_PATH, you might want to use absolute path names to for > mpicc/mpifort/mpirun/mpiexec. > > > > > On May 26, 2016, at 3:46 PM, Megdich Islem <megdich_is...@yahoo.fr> wrote: > > > > Thank you all for your suggestions !! > > > > I found an answer to a similar case in Open MPI FAQ (Question 15) > > FAQ: Running MPI jobs > > > > > > > > > > > > > > > > > > FAQ: Running MPI jobs > > Table of contents: What pre-requisites are necessary for running an Open > > MPI job? What ABI guarantees does Open MPI provide? Do I need a common > > filesystem on a... > > Afficher sur www.open-mpi.org > > Aper?u par Yahoo > > > > which suggests to use mpirun's prefix command line option or to use the > > mpirun wrapper. > > > > I modified my command to the following > > mpirun --prefix > > /opt/openfoam30/platforms/linux64GccDPInt32Opt/lib/Openmpi-system -np 1 > > pimpleDyMFoam -case OF > > > > But, I got an error (see attached picture). Is the syntax correct? How can > > I solve the problem? That first method seems to be easier than using the > > mpirun wrapper. > > > > Otherwise, how can I use the mpirun wrapper? > > > > Regards, > > islem > > > > > > Le Mercredi 25 mai 2016 16h40, Dave Love <d.l...@liverpool.ac.uk> a ?crit : > > > > > > I wrote: > > > > > > > You could wrap one (set of) program(s) in a script to set the > > > appropriate environment before invoking the real program. > > > > > > I realize I should have said something like "program invocations", > > i.e. if you have no control over something invoking mpirun for programs > > using different MPIs, then an mpirun wrapper needs to check what it's > > being asked to run. > > > > > > > > <mpirun-error.png><path-to-open-mpi.png>_______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/05/29317.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ------------------------------ > > Message: 2 > Date: Fri, 27 May 2016 08:16:41 +0200 > From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] segmentation fault for slot-list and > openmpi-1.10.3rc2 > Message-ID: > <f5653a5c-174f-4569-c730-082a9db82...@informatik.hs-fulda.de> > Content-Type: text/plain; charset=windows-1252; format=flowed > > Hi Ralph, > > > Am 26.05.2016 um 17:38 schrieb Ralph Castain: > > I?m afraid I honestly can?t make any sense of it. It seems > > you at least have a simple workaround (use a hostfile instead > > of -host), yes? > > Only the combination "--host" and "--slot-list" breaks. > Everything else works as expected. One more remark: As you > can see below, this combination worked using gdb and "next" > after the breakpoint. The process blocks, if I keep the > enter-key pressed down and I have to kill simple_spawn in > another window to get control back in gdb (<Ctrl-c> or > anything else didn't work). I got this error yesterday > evening. > > ... > (gdb) > ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbc0c) > at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > (gdb) > 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > (gdb) > 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > (gdb) > 758 if (OMPI_SUCCESS != (ret = ompi_proc_complete_init())) { > (gdb) > 764 ret = MCA_PML_CALL(enable(true)); > (gdb) > 765 if( OMPI_SUCCESS != ret ) { > (gdb) > 771 if (NULL == (procs = ompi_proc_world(&nprocs))) { > (gdb) > 775 ret = MCA_PML_CALL(add_procs(procs, nprocs)); > (gdb) > 776 free(procs); > (gdb) > 780 if (OMPI_ERR_UNREACH == ret) { > (gdb) > 785 } else if (OMPI_SUCCESS != ret) { > (gdb) > 790 MCA_PML_CALL(add_comm(&ompi_mpi_comm_world.comm)); > (gdb) > 791 MCA_PML_CALL(add_comm(&ompi_mpi_comm_self.comm)); > (gdb) > 796 if (ompi_mpi_show_mca_params) { > (gdb) > 803 ompi_rte_wait_for_debugger(); > (gdb) > 807 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > (gdb) > 817 coll = OBJ_NEW(ompi_rte_collective_t); > (gdb) > 818 coll->id = ompi_process_info.peer_init_barrier; > (gdb) > 819 coll->active = true; > (gdb) > 820 if (OMPI_SUCCESS != (ret = ompi_rte_barrier(coll))) { > (gdb) > 825 OMPI_WAIT_FOR_COMPLETION(coll->active); > (gdb) > > > > > > > > > > > > > > > Program received signal SIGTERM, Terminated. > 0x00007ffff7a7acd0 in opal_progress@plt () > from /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12 > (gdb) > Single stepping until exit from function opal_progress@plt, > which has no line number information. > [Thread 0x7ffff491b700 (LWP 19602) exited] > > Program terminated with signal SIGTERM, Terminated. > The program no longer exists. > (gdb) > The program is not being run. > (gdb) > ... > > > > Kind regards > > Siegmar > > > >> On May 26, 2016, at 5:48 AM, Siegmar Gross > >> <siegmar.gr...@informatik.hs-fulda.de> wrote: > >> > >> Hi Ralph and Gilles, > >> > >> it's strange that the program works with "--host" and "--slot-list" > >> in your environment and not in mine. I get the following output, if > >> I run the program in gdb without a breakpoint. > >> > >> > >> loki spawn 142 gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec > >> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1 > >> ... > >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn > >> (gdb) run > >> Starting program: /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 > >> --host loki --slot-list 0:0-1,1:0-1 simple_spawn > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> Detaching after fork from child process 18031. > >> [pid 18031] starting up! > >> 0 completed MPI_Init > >> Parent [pid 18031] about to spawn! > >> Detaching after fork from child process 18033. > >> Detaching after fork from child process 18034. > >> [pid 18033] starting up! > >> [pid 18034] starting up! > >> [loki:18034] *** Process received signal *** > >> [loki:18034] Signal: Segmentation fault (11) > >> ... > >> > >> > >> > >> I get a different output, if I run the program in gdb with > >> a breakpoint. > >> > >> gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec > >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn > >> (gbd) set follow-fork-mode child > >> (gdb) break ompi_proc_self > >> (gdb) run > >> (gdb) next > >> > >> Repeating "next" very often results in the following output. > >> > >> ... > >> Starting program: > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [pid 13277] starting up! > >> [New Thread 0x7ffff42ef700 (LWP 13289)] > >> > >> Breakpoint 1, ompi_proc_self (size=0x7fffffffc060) > >> at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 > >> 413 ompi_proc_t **procs = (ompi_proc_t**) > >> malloc(sizeof(ompi_proc_t*)); > >> (gdb) n > >> 414 if (NULL == procs) { > >> (gdb) > >> 423 OBJ_RETAIN(ompi_proc_local_proc); > >> (gdb) > >> 424 *procs = ompi_proc_local_proc; > >> (gdb) > >> 425 *size = 1; > >> (gdb) > >> 426 return procs; > >> (gdb) > >> 427 } > >> (gdb) > >> ompi_comm_init () at > >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138 > >> 138 group->grp_my_rank = 0; > >> (gdb) > >> 139 group->grp_proc_count = (int)size; > >> ... > >> 193 ompi_comm_reg_init(); > >> (gdb) > >> 196 ompi_comm_request_init (); > >> (gdb) > >> 198 return OMPI_SUCCESS; > >> (gdb) > >> 199 } > >> (gdb) > >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffc21c) > >> at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > >> 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > >> (gdb) > >> 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > >> (gdb) > >> 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > >> ... > >> 988 ompi_mpi_initialized = true; > >> (gdb) > >> 991 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > >> (gdb) > >> 999 return MPI_SUCCESS; > >> (gdb) > >> 1000 } > >> (gdb) > >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94 > >> 94 if (MPI_SUCCESS != err) { > >> (gdb) > >> 104 return MPI_SUCCESS; > >> (gdb) > >> 105 } > >> (gdb) > >> 0x0000000000400d0c in main () > >> (gdb) > >> Single stepping until exit from function main, > >> which has no line number information. > >> 0 completed MPI_Init > >> Parent [pid 13277] about to spawn! > >> [New process 13472] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> process 13472 is executing new program: > >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [New process 13474] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> process 13474 is executing new program: > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > >> [pid 13475] starting up! > >> [pid 13476] starting up! > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [pid 13474] starting up! > >> [New Thread 0x7ffff491b700 (LWP 13480)] > >> [Switching to Thread 0x7ffff7ff1740 (LWP 13474)] > >> > >> Breakpoint 1, ompi_proc_self (size=0x7fffffffba30) > >> at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 > >> 413 ompi_proc_t **procs = (ompi_proc_t**) > >> malloc(sizeof(ompi_proc_t*)); > >> (gdb) > >> 414 if (NULL == procs) { > >> ... > >> 426 return procs; > >> (gdb) > >> 427 } > >> (gdb) > >> ompi_comm_init () at > >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138 > >> 138 group->grp_my_rank = 0; > >> (gdb) > >> 139 group->grp_proc_count = (int)size; > >> (gdb) > >> 140 OMPI_GROUP_SET_INTRINSIC (group); > >> ... > >> 193 ompi_comm_reg_init(); > >> (gdb) > >> 196 ompi_comm_request_init (); > >> (gdb) > >> 198 return OMPI_SUCCESS; > >> (gdb) > >> 199 } > >> (gdb) > >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbbec) > >> at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > >> 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > >> (gdb) > >> 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > >> (gdb) > >> 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > >> ... > >> 863 if (OMPI_SUCCESS != (ret = ompi_pubsub_base_select())) { > >> (gdb) > >> 869 if (OMPI_SUCCESS != (ret = > >> mca_base_framework_open(&ompi_dpm_base_framework, 0))) { > >> (gdb) > >> 873 if (OMPI_SUCCESS != (ret = ompi_dpm_base_select())) { > >> (gdb) > >> 884 if ( OMPI_SUCCESS != > >> (gdb) > >> 894 if (OMPI_SUCCESS != > >> (gdb) > >> 900 if (OMPI_SUCCESS != > >> (gdb) > >> 911 if (OMPI_SUCCESS != (ret = ompi_dpm.dyn_init())) { > >> (gdb) > >> Parent done with spawn > >> Parent sending message to child > >> 2 completed MPI_Init > >> Hello from the child 2 of 3 on host loki pid 13476 > >> 1 completed MPI_Init > >> Hello from the child 1 of 3 on host loki pid 13475 > >> 921 if (OMPI_SUCCESS != (ret = ompi_cr_init())) { > >> (gdb) > >> 931 opal_progress_event_users_decrement(); > >> (gdb) > >> 934 opal_progress_set_yield_when_idle(ompi_mpi_yield_when_idle); > >> (gdb) > >> 937 if (ompi_mpi_event_tick_rate >= 0) { > >> (gdb) > >> 946 if (OMPI_SUCCESS != (ret = ompi_mpiext_init())) { > >> (gdb) > >> 953 if (ret != OMPI_SUCCESS) { > >> (gdb) > >> 972 OBJ_CONSTRUCT(&ompi_registered_datareps, opal_list_t); > >> (gdb) > >> 977 OBJ_CONSTRUCT( &ompi_mpi_f90_integer_hashtable, > >> opal_hash_table_t); > >> (gdb) > >> 978 opal_hash_table_init(&ompi_mpi_f90_integer_hashtable, 16 /* why > >> not? */); > >> (gdb) > >> 980 OBJ_CONSTRUCT( &ompi_mpi_f90_real_hashtable, opal_hash_table_t); > >> (gdb) > >> 981 opal_hash_table_init(&ompi_mpi_f90_real_hashtable, > >> FLT_MAX_10_EXP); > >> (gdb) > >> 983 OBJ_CONSTRUCT( &ompi_mpi_f90_complex_hashtable, > >> opal_hash_table_t); > >> (gdb) > >> 984 opal_hash_table_init(&ompi_mpi_f90_complex_hashtable, > >> FLT_MAX_10_EXP); > >> (gdb) > >> 988 ompi_mpi_initialized = true; > >> (gdb) > >> 991 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > >> (gdb) > >> 999 return MPI_SUCCESS; > >> (gdb) > >> 1000 } > >> (gdb) > >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94 > >> 94 if (MPI_SUCCESS != err) { > >> (gdb) > >> 104 return MPI_SUCCESS; > >> (gdb) > >> 105 } > >> (gdb) > >> 0x0000000000400d0c in main () > >> (gdb) > >> Single stepping until exit from function main, > >> which has no line number information. > >> 0 completed MPI_Init > >> Hello from the child 0 of 3 on host loki pid 13474 > >> > >> Child 2 disconnected > >> Child 1 disconnected > >> Child 0 received msg: 38 > >> Parent disconnected > >> 13277: exiting > >> > >> Program received signal SIGTERM, Terminated. > >> 0x0000000000400f0a in main () > >> (gdb) > >> Single stepping until exit from function main, > >> which has no line number information. > >> [tcsetpgrp failed in terminal_inferior: No such process] > >> [Thread 0x7ffff491b700 (LWP 13480) exited] > >> > >> Program terminated with signal SIGTERM, Terminated. > >> The program no longer exists. > >> (gdb) > >> The program is not being run. > >> (gdb) > >> The program is not being run. > >> (gdb) info break > >> Num Type Disp Enb Address What > >> 1 breakpoint keep y 0x00007ffff7aa35c7 in ompi_proc_self > >> at > >> ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 inf 8, 7, 6, 5, 4, 3, 2, 1 > >> breakpoint already hit 2 times > >> (gdb) delete 1 > >> (gdb) r > >> Starting program: > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [pid 16708] starting up! > >> 0 completed MPI_Init > >> Parent [pid 16708] about to spawn! > >> [New process 16720] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> process 16720 is executing new program: > >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [New process 16722] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> process 16722 is executing new program: > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > >> [pid 16723] starting up! > >> [pid 16724] starting up! > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> [pid 16722] starting up! > >> Parent done with spawn > >> Parent sending message to child > >> 1 completed MPI_Init > >> Hello from the child 1 of 3 on host loki pid 16723 > >> 2 completed MPI_Init > >> Hello from the child 2 of 3 on host loki pid 16724 > >> 0 completed MPI_Init > >> Hello from the child 0 of 3 on host loki pid 16722 > >> Child 0 received msg: 38 > >> Child 0 disconnected > >> Parent disconnected > >> Child 1 disconnected > >> Child 2 disconnected > >> 16708: exiting > >> 16724: exiting > >> 16723: exiting > >> [New Thread 0x7ffff491b700 (LWP 16729)] > >> > >> Program received signal SIGTERM, Terminated. > >> [Switching to Thread 0x7ffff7ff1740 (LWP 16722)] > >> __GI__dl_debug_state () at dl-debug.c:74 > >> 74 dl-debug.c: No such file or directory. > >> (gdb) > >> -------------------------------------------------------------------------- > >> WARNING: A process refused to die despite all the efforts! > >> This process may still be running and/or consuming resources. > >> > >> Host: loki > >> PID: 16722 > >> > >> -------------------------------------------------------------------------- > >> > >> > >> The following simple_spawn processes exist now. > >> > >> loki spawn 171 ps -aef | grep simple_spawn > >> fd1026 11079 11053 0 14:00 pts/0 00:00:00 > >> /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 --host loki --slot-list > >> 0:0-1,1:0-1 simple_spawn > >> fd1026 11095 11079 29 14:01 pts/0 00:09:37 [simple_spawn] <defunct> > >> fd1026 16722 1 0 14:31 ? 00:00:00 [simple_spawn] <defunct> > >> fd1026 17271 29963 0 14:33 pts/2 00:00:00 grep simple_spawn > >> loki spawn 172 > >> > >> > >> Is it possible that there is a race condition? How can I help > >> to get a solution for my problem? > >> > >> > >> Kind regards > >> > >> Siegmar > >> > >> Am 24.05.2016 um 16:54 schrieb Ralph Castain: > >>> Works perfectly for me, so I believe this must be an environment issue - > >>> I am using gcc 6.0.0 on CentOS7 with x86: > >>> > >>> $ mpirun -n 1 -host bend001 --slot-list 0:0-1,1:0-1 --report-bindings > >>> ./simple_spawn > >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: > >>> [BB/BB/../../../..][BB/BB/../../../..] > >>> [pid 17601] starting up! > >>> 0 completed MPI_Init > >>> Parent [pid 17601] about to spawn! > >>> [pid 17603] starting up! > >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: > >>> [BB/BB/../../../..][BB/BB/../../../..] > >>> [bend001:17599] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: > >>> [BB/BB/../../../..][BB/BB/../../../..] > >>> [bend001:17599] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: > >>> [BB/BB/../../../..][BB/BB/../../../..] > >>> [pid 17604] starting up! > >>> [pid 17605] starting up! > >>> Parent done with spawn > >>> Parent sending message to child > >>> 0 completed MPI_Init > >>> Hello from the child 0 of 3 on host bend001 pid 17603 > >>> Child 0 received msg: 38 > >>> 1 completed MPI_Init > >>> Hello from the child 1 of 3 on host bend001 pid 17604 > >>> 2 completed MPI_Init > >>> Hello from the child 2 of 3 on host bend001 pid 17605 > >>> Child 0 disconnected > >>> Child 2 disconnected > >>> Parent disconnected > >>> Child 1 disconnected > >>> 17603: exiting > >>> 17605: exiting > >>> 17601: exiting > >>> 17604: exiting > >>> $ > >>> > >>>> On May 24, 2016, at 7:18 AM, Siegmar Gross > >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: > >>>> > >>>> Hi Ralph and Gilles, > >>>> > >>>> the program breaks only, if I combine "--host" and "--slot-list". > >>>> Perhaps this > >>>> information is helpful. I use a different machine now, so that you can > >>>> see that > >>>> the problem is not restricted to "loki". > >>>> > >>>> > >>>> pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler > >>>> absolute:" > >>>> OPAL repo revision: v1.10.2-201-gd23dda8 > >>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > >>>> > >>>> > >>>> pc03 spawn 116 uname -a > >>>> Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 > >>>> (4354e1d) x86_64 x86_64 x86_64 GNU/Linux > >>>> > >>>> > >>>> pc03 spawn 117 cat host_pc03.openmpi > >>>> pc03.informatik.hs-fulda.de slots=12 max_slots=12 > >>>> > >>>> > >>>> pc03 spawn 118 mpicc simple_spawn.c > >>>> > >>>> > >>>> pc03 spawn 119 mpiexec -np 1 --report-bindings a.out > >>>> [pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > >>>> [BB/../../../../..][../../../../../..] > >>>> [pid 3713] starting up! > >>>> 0 completed MPI_Init > >>>> Parent [pid 3713] about to spawn! > >>>> [pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket > >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > >>>> [../../../../../..][BB/BB/BB/BB/BB/BB] > >>>> [pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > >>>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: > >>>> [BB/BB/BB/BB/BB/BB][../../../../../..] > >>>> [pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket > >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > >>>> [../../../../../..][BB/BB/BB/BB/BB/BB] > >>>> [pid 3715] starting up! > >>>> [pid 3716] starting up! > >>>> [pid 3717] starting up! > >>>> Parent done with spawn > >>>> Parent sending message to child > >>>> 0 completed MPI_Init > >>>> Hello from the child 0 of 3 on host pc03 pid 3715 > >>>> 1 completed MPI_Init > >>>> Hello from the child 1 of 3 on host pc03 pid 3716 > >>>> 2 completed MPI_Init > >>>> Hello from the child 2 of 3 on host pc03 pid 3717 > >>>> Child 0 received msg: 38 > >>>> Child 0 disconnected > >>>> Child 2 disconnected > >>>> Parent disconnected > >>>> Child 1 disconnected > >>>> 3713: exiting > >>>> 3715: exiting > >>>> 3716: exiting > >>>> 3717: exiting > >>>> > >>>> > >>>> pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list > >>>> 0:0-1,1:0-1 --report-bindings a.out > >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pid 3731] starting up! > >>>> 0 completed MPI_Init > >>>> Parent [pid 3731] about to spawn! > >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pid 3733] starting up! > >>>> [pid 3734] starting up! > >>>> [pid 3735] starting up! > >>>> Parent done with spawn > >>>> Parent sending message to child > >>>> 2 completed MPI_Init > >>>> Hello from the child 2 of 3 on host pc03 pid 3735 > >>>> 1 completed MPI_Init > >>>> Hello from the child 1 of 3 on host pc03 pid 3734 > >>>> 0 completed MPI_Init > >>>> Hello from the child 0 of 3 on host pc03 pid 3733 > >>>> Child 0 received msg: 38 > >>>> Child 0 disconnected > >>>> Child 2 disconnected > >>>> Child 1 disconnected > >>>> Parent disconnected > >>>> 3731: exiting > >>>> 3734: exiting > >>>> 3733: exiting > >>>> 3735: exiting > >>>> > >>>> > >>>> pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1 > >>>> --report-bindings a.out > >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pid 3746] starting up! > >>>> 0 completed MPI_Init > >>>> Parent [pid 3746] about to spawn! > >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > >>>> [pid 3748] starting up! > >>>> [pid 3749] starting up! > >>>> [pc03:03749] *** Process received signal *** > >>>> [pc03:03749] Signal: Segmentation fault (11) > >>>> [pc03:03749] Signal code: Address not mapped (1) > >>>> [pc03:03749] Failing at address: 0x8 > >>>> [pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870] > >>>> [pc03:03749] [ 1] > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0] > >>>> [pc03:03749] [ 2] > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08] > >>>> [pc03:03749] [ 3] > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a] > >>>> [pc03:03749] [ 4] > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae] > >>>> [pc03:03749] [ 5] a.out[0x400d0c] > >>>> [pc03:03749] [ 6] > >>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe6f0989b05] > >>>> [pc03:03749] [ 7] a.out[0x400bf9] > >>>> [pc03:03749] *** End of error message *** > >>>> -------------------------------------------------------------------------- > >>>> mpiexec noticed that process rank 2 with PID 3749 on node pc03 exited on > >>>> signal 11 (Segmentation fault). > >>>> -------------------------------------------------------------------------- > >>>> pc03 spawn 122 > >>>> > >>>> > >>>> > >>>> Kind regards > >>>> > >>>> Siegmar > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 05/24/16 15:44, Ralph Castain wrote: > >>>>> > >>>>>> On May 24, 2016, at 6:21 AM, Siegmar Gross > >>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: > >>>>>> > >>>>>> Hi Ralph, > >>>>>> > >>>>>> I copy the relevant lines to this place, so that it is easier to see > >>>>>> what > >>>>>> happens. "a.out" is your program, which I compiled with mpicc. > >>>>>> > >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C > >>>>>>>> compiler > >>>>>>>> absolute:" > >>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > >>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > >>>>>>>> loki spawn 154 mpicc simple_spawn.c > >>>>>> > >>>>>>>> loki spawn 155 mpiexec -np 1 a.out > >>>>>>>> [pid 24008] starting up! > >>>>>>>> 0 completed MPI_Init > >>>>>> ... > >>>>>> > >>>>>> "mpiexec -np 1 a.out" works. > >>>>>> > >>>>>> > >>>>>> > >>>>>>> I don?t know what ?a.out? is, but it looks like there is some memory > >>>>>>> corruption there. > >>>>>> > >>>>>> "a.out" is still your program. I get the same error on different > >>>>>> machines, so that it is not very likely, that the (hardware) memory > >>>>>> is corrupted. > >>>>>> > >>>>>> > >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > >>>>>>>> [pid 24102] starting up! > >>>>>>>> 0 completed MPI_Init > >>>>>>>> Parent [pid 24102] about to spawn! > >>>>>>>> [pid 24104] starting up! > >>>>>>>> [pid 24105] starting up! > >>>>>>>> [loki:24105] *** Process received signal *** > >>>>>>>> [loki:24105] Signal: Segmentation fault (11) > >>>>>>>> [loki:24105] Signal code: Address not mapped (1) > >>>>>> ... > >>>>>> > >>>>>> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a > >>>>>> segmentation > >>>>>> faUlt. Can I do something, so that you can find out, what happens? > >>>>> > >>>>> I honestly have no idea - perhaps Gilles can help, as I have no access > >>>>> to that kind of environment. We aren?t seeing such problems elsewhere, > >>>>> so it is likely something local. > >>>>> > >>>>>> > >>>>>> > >>>>>> Kind regards > >>>>>> > >>>>>> Siegmar > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 05/24/16 15:07, Ralph Castain wrote: > >>>>>>> > >>>>>>>> On May 24, 2016, at 4:19 AM, Siegmar Gross > >>>>>>>> <siegmar.gr...@informatik.hs-fulda.de > >>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: > >>>>>>>> > >>>>>>>> Hi Ralph, > >>>>>>>> > >>>>>>>> thank you very much for your answer and your example program. > >>>>>>>> > >>>>>>>> On 05/23/16 17:45, Ralph Castain wrote: > >>>>>>>>> I cannot replicate the problem - both scenarios work fine for me. > >>>>>>>>> I?m not > >>>>>>>>> convinced your test code is correct, however, as you call Comm_free > >>>>>>>>> the > >>>>>>>>> inter-communicator but didn?t call Comm_disconnect. Checkout the > >>>>>>>>> attached > >>>>>>>>> for a correct code and see if it works for you. > >>>>>>>> > >>>>>>>> I thought that I only need MPI_Comm_Disconnect, if I would have > >>>>>>>> established a > >>>>>>>> connection with MPI_Comm_connect before. The man page for > >>>>>>>> MPI_Comm_free states > >>>>>>>> > >>>>>>>> "This operation marks the communicator object for deallocation. The > >>>>>>>> handle is set to MPI_COMM_NULL. Any pending operations that use this > >>>>>>>> communicator will complete normally; the object is actually > >>>>>>>> deallocated only > >>>>>>>> if there are no other active references to it.". > >>>>>>>> > >>>>>>>> The man page for MPI_Comm_disconnect states > >>>>>>>> > >>>>>>>> "MPI_Comm_disconnect waits for all pending communication on comm to > >>>>>>>> complete > >>>>>>>> internally, deallocates the communicator object, and sets the handle > >>>>>>>> to > >>>>>>>> MPI_COMM_NULL. It is a collective operation.". > >>>>>>>> > >>>>>>>> I don't see a difference for my spawned processes, because both > >>>>>>>> functions will > >>>>>>>> "wait" until all pending operations have finished, before the object > >>>>>>>> will be > >>>>>>>> destroyed. Nevertheless, perhaps my small example program worked all > >>>>>>>> the years > >>>>>>>> by chance. > >>>>>>>> > >>>>>>>> However, I don't understand, why my program works with > >>>>>>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and > >>>>>>>> breaks with > >>>>>>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". > >>>>>>>> You are right, > >>>>>>>> my slot-list is equivalent to "-bind-to none". I could also have used > >>>>>>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works > >>>>>>>> as well. > >>>>>>> > >>>>>>> Well, you are only giving us one slot when you specify "-host loki?, > >>>>>>> and then > >>>>>>> you are trying to launch multiple processes into it. The ?slot-list? > >>>>>>> option only > >>>>>>> tells us what cpus to bind each process to - it doesn?t allocate > >>>>>>> process slots. > >>>>>>> So you have to tell us how many processes are allowed to run on this > >>>>>>> node. > >>>>>>> > >>>>>>>> > >>>>>>>> The program breaks with "There are not enough slots available in the > >>>>>>>> system > >>>>>>>> to satisfy ...", if I only use "--host loki" or different host names, > >>>>>>>> without mentioning five host names, using "slot-list", or > >>>>>>>> "oversubscribe", > >>>>>>>> Unfortunately "--host <host name>:<number of slots>" isn't available > >>>>>>>> for > >>>>>>>> openmpi-1.10.3rc2 to specify the number of available slots. > >>>>>>> > >>>>>>> Correct - we did not backport the new syntax > >>>>>>> > >>>>>>>> > >>>>>>>> Your program behaves the same way as mine, so that > >>>>>>>> MPI_Comm_disconnect > >>>>>>>> will not solve my problem. I had to modify your program in a > >>>>>>>> negligible way > >>>>>>>> to get it compiled. > >>>>>>>> > >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C > >>>>>>>> compiler absolute:" > >>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > >>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > >>>>>>>> loki spawn 154 mpicc simple_spawn.c > >>>>>>>> loki spawn 155 mpiexec -np 1 a.out > >>>>>>>> [pid 24008] starting up! > >>>>>>>> 0 completed MPI_Init > >>>>>>>> Parent [pid 24008] about to spawn! > >>>>>>>> [pid 24010] starting up! > >>>>>>>> [pid 24011] starting up! > >>>>>>>> [pid 24012] starting up! > >>>>>>>> Parent done with spawn > >>>>>>>> Parent sending message to child > >>>>>>>> 0 completed MPI_Init > >>>>>>>> Hello from the child 0 of 3 on host loki pid 24010 > >>>>>>>> 1 completed MPI_Init > >>>>>>>> Hello from the child 1 of 3 on host loki pid 24011 > >>>>>>>> 2 completed MPI_Init > >>>>>>>> Hello from the child 2 of 3 on host loki pid 24012 > >>>>>>>> Child 0 received msg: 38 > >>>>>>>> Child 0 disconnected > >>>>>>>> Child 1 disconnected > >>>>>>>> Child 2 disconnected > >>>>>>>> Parent disconnected > >>>>>>>> 24012: exiting > >>>>>>>> 24010: exiting > >>>>>>>> 24008: exiting > >>>>>>>> 24011: exiting > >>>>>>>> > >>>>>>>> > >>>>>>>> Is something wrong with my command line? I didn't use slot-list > >>>>>>>> before, so > >>>>>>>> that I'm not sure, if I use it in the intended way. > >>>>>>> > >>>>>>> I don?t know what ?a.out? is, but it looks like there is some memory > >>>>>>> corruption > >>>>>>> there. > >>>>>>> > >>>>>>>> > >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > >>>>>>>> [pid 24102] starting up! > >>>>>>>> 0 completed MPI_Init > >>>>>>>> Parent [pid 24102] about to spawn! > >>>>>>>> [pid 24104] starting up! > >>>>>>>> [pid 24105] starting up! > >>>>>>>> [loki:24105] *** Process received signal *** > >>>>>>>> [loki:24105] Signal: Segmentation fault (11) > >>>>>>>> [loki:24105] Signal code: Address not mapped (1) > >>>>>>>> [loki:24105] Failing at address: 0x8 > >>>>>>>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870] > >>>>>>>> [loki:24105] [ 1] > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0] > >>>>>>>> [loki:24105] [ 2] > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08] > >>>>>>>> [loki:24105] [ 3] *** An error occurred in MPI_Init > >>>>>>>> *** on a NULL communicator > >>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > >>>>>>>> abort, > >>>>>>>> *** and potentially your MPI job) > >>>>>>>> [loki:24104] Local abort before MPI_INIT completed successfully; not > >>>>>>>> able to > >>>>>>>> aggregate error messages, and not able to guarantee that all other > >>>>>>>> processes > >>>>>>>> were killed! > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a] > >>>>>>>> [loki:24105] [ 4] > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae] > >>>>>>>> [loki:24105] [ 5] a.out[0x400d0c] > >>>>>>>> [loki:24105] [ 6] > >>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05] > >>>>>>>> [loki:24105] [ 7] a.out[0x400bf9] > >>>>>>>> [loki:24105] *** End of error message *** > >>>>>>>> ------------------------------------------------------- > >>>>>>>> Child job 2 terminated normally, but 1 process returned > >>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. > >>>>>>>> ------------------------------------------------------- > >>>>>>>> -------------------------------------------------------------------------- > >>>>>>>> mpiexec detected that one or more processes exited with non-zero > >>>>>>>> status, thus > >>>>>>>> causing > >>>>>>>> the job to be terminated. The first process to do so was: > >>>>>>>> > >>>>>>>> Process name: [[49560,2],0] > >>>>>>>> Exit code: 1 > >>>>>>>> -------------------------------------------------------------------------- > >>>>>>>> loki spawn 157 > >>>>>>>> > >>>>>>>> > >>>>>>>> Hopefully, you will find out what happens. Please let me know, if I > >>>>>>>> can > >>>>>>>> help you in any way. > >>>>>>>> > >>>>>>>> Kind regards > >>>>>>>> > >>>>>>>> Siegmar > >>>>>>>> > >>>>>>>> > >>>>>>>>> FWIW: I don?t know how many cores you have on your sockets, but if > >>>>>>>>> you > >>>>>>>>> have 6 cores/socket, then your slot-list is equivalent to ??bind-to > >>>>>>>>> none? > >>>>>>>>> as the slot-list applies to every process being launched > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross > >>>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de > >>>>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server > >>>>>>>>>> 12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get > >>>>>>>>>> a segmentation fault for "--slot-list" for one of my small > >>>>>>>>>> programs. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C > >>>>>>>>>> compiler > >>>>>>>>>> absolute:" > >>>>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > >>>>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki > >>>>>>>>>> spawn_master > >>>>>>>>>> > >>>>>>>>>> Parent process 0 running on loki > >>>>>>>>>> I create 4 slave processes > >>>>>>>>>> > >>>>>>>>>> Parent process 0: tasks in MPI_COMM_WORLD: 1 > >>>>>>>>>> tasks in COMM_CHILD_PROCESSES local group: 1 > >>>>>>>>>> tasks in COMM_CHILD_PROCESSES remote group: 4 > >>>>>>>>>> > >>>>>>>>>> Slave process 0 of 4 running on loki > >>>>>>>>>> Slave process 1 of 4 running on loki > >>>>>>>>>> Slave process 2 of 4 running on loki > >>>>>>>>>> spawn_slave 2: argv[0]: spawn_slave > >>>>>>>>>> Slave process 3 of 4 running on loki > >>>>>>>>>> spawn_slave 0: argv[0]: spawn_slave > >>>>>>>>>> spawn_slave 1: argv[0]: spawn_slave > >>>>>>>>>> spawn_slave 3: argv[0]: spawn_slave > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > >>>>>>>>>> spawn_master > >>>>>>>>>> > >>>>>>>>>> Parent process 0 running on loki > >>>>>>>>>> I create 4 slave processes > >>>>>>>>>> > >>>>>>>>>> [loki:17326] *** Process received signal *** > >>>>>>>>>> [loki:17326] Signal: Segmentation fault (11) > >>>>>>>>>> [loki:17326] Signal code: Address not mapped (1) > >>>>>>>>>> [loki:17326] Failing at address: 0x8 > >>>>>>>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870] > >>>>>>>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init > >>>>>>>>>> *** on a NULL communicator > >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > >>>>>>>>>> abort, > >>>>>>>>>> *** and potentially your MPI job) > >>>>>>>>>> [loki:17324] Local abort before MPI_INIT completed successfully; > >>>>>>>>>> not able to > >>>>>>>>>> aggregate error messages, and not able to guarantee that all other > >>>>>>>>>> processes > >>>>>>>>>> were killed! > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0] > >>>>>>>>>> [loki:17326] [ 2] > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08] > >>>>>>>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init > >>>>>>>>>> *** on a NULL communicator > >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > >>>>>>>>>> abort, > >>>>>>>>>> *** and potentially your MPI job) > >>>>>>>>>> [loki:17325] Local abort before MPI_INIT completed successfully; > >>>>>>>>>> not able to > >>>>>>>>>> aggregate error messages, and not able to guarantee that all other > >>>>>>>>>> processes > >>>>>>>>>> were killed! > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a] > >>>>>>>>>> [loki:17326] [ 4] > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e] > >>>>>>>>>> [loki:17326] [ 5] spawn_slave[0x40097e] > >>>>>>>>>> [loki:17326] [ 6] > >>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05] > >>>>>>>>>> [loki:17326] [ 7] spawn_slave[0x400a54] > >>>>>>>>>> [loki:17326] *** End of error message *** > >>>>>>>>>> ------------------------------------------------------- > >>>>>>>>>> Child job 2 terminated normally, but 1 process returned > >>>>>>>>>> a non-zero exit code.. Per user-direction, the job has been > >>>>>>>>>> aborted. > >>>>>>>>>> ------------------------------------------------------- > >>>>>>>>>> -------------------------------------------------------------------------- > >>>>>>>>>> mpiexec detected that one or more processes exited with non-zero > >>>>>>>>>> status, > >>>>>>>>>> thus causing > >>>>>>>>>> the job to be terminated. The first process to do so was: > >>>>>>>>>> > >>>>>>>>>> Process name: [[56340,2],0] > >>>>>>>>>> Exit code: 1 > >>>>>>>>>> -------------------------------------------------------------------------- > >>>>>>>>>> loki spawn 122 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I would be grateful, if somebody can fix the problem. Thank you > >>>>>>>>>> very much for any help in advance. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Kind regards > >>>>>>>>>> > >>>>>>>>>> Siegmar > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>> Link to this post: > >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> Link to this > >>>>>>>>> post: > >>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29284.php > >>>>>>>>> > >>>>>>>> <simple_spawn_modified.c>_______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> Link to this post: > >>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29301.php > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29304.php > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/users/2016/05/29307.php > >>>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2016/05/29308.php > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2016/05/29309.php > >>> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2016/05/29315.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/05/29316.php > > > > > > ------------------------------ > > Message: 3 > Date: Fri, 27 May 2016 09:14:42 +0000 > From: "Marco D'Amico" <marco.damic...@gmail.com> > To: us...@open-mpi.org > Subject: [OMPI users] OpenMPI virtualization aware > Message-ID: > <CABi-01XH+vdi2egBD=knen_cyxpecg0j-+3rtvnfnc6mtd+...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi I'm recently investigating in Virtualization used in HPC field, and I > found out that MVAPICH has a "Virtualization aware" version, that permit to > overcome the big latencies problems of using a Virtualization environment > for HPC. > > My question is if there is any similar efforts in OpenMPI, since I would > eventually contribute in it. > > Best regards, > Marco D'Amico > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 4 > Date: Fri, 27 May 2016 06:45:05 -0700 > From: Ralph Castain <r...@open-mpi.org> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] OpenMPI virtualization aware > Message-ID: <bbeb8e66-40b0-4688-8284-2113252e1...@open-mpi.org> > Content-Type: text/plain; charset="utf-8" > > Hi Marco > > OMPI has integrated support for the Singularity container: > > http://singularity.lbl.gov/index.html <http://singularity.lbl.gov/index.html> > > https://groups.google.com/a/lbl.gov/forum/#!forum/singularity > <https://groups.google.com/a/lbl.gov/forum/#!forum/singularity> > > It is in OMPI master now, and an early version is in 2.0 - the full > integration will be in 2.1. Singularity is undergoing changes for its 2.0 > release (so we?ll need to do some updating of the OMPI integration), and > there is still plenty that can be done to further optimize its integration - > so contributions would be welcome! > > Ralph > > > > > On May 27, 2016, at 2:14 AM, Marco D'Amico <marco.damic...@gmail.com> wrote: > > > > Hi I'm recently investigating in Virtualization used in HPC field, and I > > found out that MVAPICH has a "Virtualization aware" version, that permit to > > overcome the big latencies problems of using a Virtualization environment > > for HPC. > > > > My question is if there is any similar efforts in OpenMPI, since I would > > eventually contribute in it. > > > > Best regards, > > Marco D'Amico > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/05/29320.php > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > https://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 3514, Issue 1 > ************************************** > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29341.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/