I'm afraid I don't know much about OpenFoam. You'll likely need to ask assistance from the OpenFoam community.
> On Jun 1, 2016, at 6:13 PM, Megdich Islem <megdich_is...@yahoo.fr> wrote: > > Hi! > > Thank you Jeff. I was able to run a case of OF by setting the absolute path > name for mpiexec. But, when I wanted to run a coupled case in which OF is > coupled with dummyCSM through EMPIRE using these three command lines: > > mpiexec -np 1 Emperor emperorInput.xml > mpiexec -np 1 dummyCSM dummyCSMInput > mpiexec -np 1 pimpleDyMFoam -case OF, > > I was still getting OF not able to connect. In the user guide of EMPIRE, it > is said that emperor (the client) has to recognize the clients which are > dummyCSM and OpenFOAM. For some reasons emperor is not able to recognize > OpenFoam but it recognizes dummyCSM. > > What can be the cause that a server can not recognize a client ? > > Regards, > Islem > > > Le Mercredi 1 juin 2016 17h00, "users-requ...@open-mpi.org" > <users-requ...@open-mpi.org> a écrit : > > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Firewall settings for MPI communication > (Jeff Squyres (jsquyres)) > 2. Re: users Digest, Vol 3514, Issue 1 (Jeff Squyres (jsquyres)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 1 Jun 2016 13:02:22 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: "Open MPI User's List" <us...@open-mpi.org> > Subject: Re: [OMPI users] Firewall settings for MPI communication > Message-ID: <ae97b273-c16b-4bc3-bc30-924d25697...@cisco.com> > Content-Type: text/plain; charset="utf-8" > > In addition, you might want to consider upgrading to Open MPI v1.10.x (v1.6.x > is fairly ancient). > > > On Jun 1, 2016, at 7:46 AM, Gilles Gouaillardet > > <gilles.gouaillar...@gmail.com> wrote: > > > > which network are your VMs using for communications ? > > if this is tcp, then you also have to specify a restricted set of allowed > > ports for the tcp btl > > > > that would be something like > > mpirun --mca btl_tcp_dynamic_ports 49990-50010 ... > > > > please double check the Open MPI 1.6.5 parameter and syntax with > > ompi_info --all > > (or check the archives, I think I posted the correct command line a few > > weeks ago) > > > > Cheers, > > > > Gilles > > > > On Wednesday, June 1, 2016, Ping Wang <ping.w...@asc-s.de> wrote: > > I'm using Open MPI 1.6.5 to run OpenFOAM in parallel on several VMs on a > > cloud. mpirun hangs without any error messages. I think this is a firewall > > issue. Because when I open all the TCP ports(1-65535) in the security group > > of VMs, mpirun works well. However I was suggested to open as less ports as > > possible. So I have to limit MPI to run on a range of ports. I opened the > > port range 49990-50010 for MPI communication. And use command > > > > > > > > mpirun --mca oob_tcp_dynamic_ports 49990-50010 -np 4 --hostfile machines > > simpleFoam ?parallel. > > > > > > > > But it still hangs. How can I specify a port range that OpenMPI will use? I > > appreciate any help you can provide. > > > > > > > > Best, > > > > Ping Wang > > > > > > > > <image001.png> > > > > ------------------------------------------------------ > > > > Ping Wang > > > > Automotive Simulation Center Stuttgart e.V. > > > > Nobelstra?e 15 > > > > D-70569 Stuttgart > > > > Telefon: +49 711 699659-14 > > > > Fax: +49 711 699659-29 > > > > E-Mail: ping.w...@asc-s.de > > > > Web: http://www.asc-s.de > > > > Social Media: <image002.gif>/asc.stuttgart > > > > ------------------------------------------------------ > > > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/06/29340.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ------------------------------ > > Message: 2 > Date: Wed, 1 Jun 2016 13:51:55 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Megdich Islem <megdich_is...@yahoo.fr>, "Open MPI User's List" > <us...@open-mpi.org> > Subject: Re: [OMPI users] users Digest, Vol 3514, Issue 1 > Message-ID: <c44110cb-7576-4272-bc53-9c274ef1e...@cisco.com> > Content-Type: text/plain; charset="utf-8" > > The example you list below has all MPICH paths -- I don't see any Open MPI > setups in there. > > What I was suggesting was that if you absolutely need to have both Open MPI > and MPICH installed and in your PATH / LD_LIBRARY_PATH / MANPATH, then you > can use the full, absolute path name to each of the Open MPI executables -- > e.g., /path/to/openmpi/install/bin/mpicc, etc. That way, you can use Open > MPI's mpicc without having it in your path. > > Additionally, per > https://www.open-mpi.org/faq/?category=running#mpirun-prefix, if you specify > the absolute path name to mpirun (or mpiexec -- they're identical in Open > MPI) and you're using the rsh/ssh launcher in Open MPI, then Open MPI will > set the right PATH / LD_LIBRARY_PATH on remote servers for you. See the FAQ > link for more detail. > > > > > On Jun 1, 2016, at 8:41 AM, Megdich Islem <megdich_is...@yahoo.fr> wrote: > > > > Hi! > > > > Thank you Jeff for you suggestion. But, I am still not able to understand > > what do you mean by using absolute path names to for > > mpicc/mpifort-mpirun/mpiexec ? > > > > This is how my .bashrc looks like > > > > source /opt/openfoam30/etc/bashrc > > > > export PATH=/home/Desktop/mpich/bin:$PATH > > export LD_LIBRARY_PATH="/home/islem/Desktop/mpich/lib/:$LD_LIBRARY_PATH" > > export MPICH_F90=gfortran > > export MPICH_CC=/opt/intel/bin/icc > > export MPICH_CXX=/opt/intel/bin/icpc > > export MPICH_LINK_CXX="-L/home/Desktop/mpich/lib/ -Wl,-rpath > > -Wl,/home/islem/Desktop/mpich/lib -lmpichcxx -lmpich -lopa -lmpl -lrt > > -lpthread" > > > > export PATH=$PATH:/opt/intel/bin/ > > LD_LIBRARY_PATH="/opt/intel/lib/intel64:$LD_LIBRARY_PATH" > > export LD_LIBRARY_PATH > > source > > /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/mpivars.sh > > intel64 > > > > alias startEMPIRE=". /home/islem/software/empire/EMPIRE-Core/etc/bashrc.sh > > ICC" > > > > mpirun --version gives mpich 3.0.4 > > > > This is how I run one example that couples 2 clients through the server > > EMPIRE. > > I use three terminals, in each I write one of these command lines > > > > mpiexec -np 1 Emperor emperorInput.xml (I got a message in the terminal > > saying that Empire started) > > > > mpiexec -np 1 dummyCSM dummyCSMInput (I get a message that Emperor > > acknowledged connection) > > mpiexec -np 1 pimpleDyMFoam -case OF (I got no message in the terminal > > which means no connection) > > > > How can I use the mpirun and where to right any modifications ? > > > > Regards, > > Islem > > > > > > Le Vendredi 27 mai 2016 17h00, "users-requ...@open-mpi.org" > > <users-requ...@open-mpi.org> a ?crit : > > > > > > Send users mailing list submissions to > > us...@open-mpi.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://www.open-mpi.org/mailman/listinfo.cgi/users > > or, via email, send a message with subject or body 'help' to > > users-requ...@open-mpi.org > > > > You can reach the person managing the list at > > users-ow...@open-mpi.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of users digest..." > > > > > > Today's Topics: > > > > 1. Re: users Digest, Vol 3510, Issue 2 (Jeff Squyres (jsquyres)) > > 2. Re: segmentation fault for slot-list and openmpi-1.10.3rc2 > > (Siegmar Gross) > > 3. OpenMPI virtualization aware (Marco D'Amico) > > 4. Re: OpenMPI virtualization aware (Ralph Castain) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Thu, 26 May 2016 23:28:17 +0000 > > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > > To: Megdich Islem <megdich_is...@yahoo.fr>, "Open MPI User's List" > > <us...@open-mpi.org> > > Cc: Dave Love <d.l...@liverpool.ac.uk> > > Subject: Re: [OMPI users] users Digest, Vol 3510, Issue 2 > > Message-ID: <441f803d-fdbb-443d-82aa-74ff3845a...@cisco.com> > > Content-Type: text/plain; charset="utf-8" > > > > You're still intermingling your Open MPI and MPICH installations. > > > > You need to ensure to use the wrapper compilers and mpirun/mpiexec from the > > same MPI implementation. > > > > For example, if you use mpicc/mpifort from Open MPI to build your program, > > then you must use Open MPI's mpirun/mpiexec. > > > > If you absolutely need to have both MPI implementations in your PATH / > > LD_LIBRARY_PATH, you might want to use absolute path names to for > > mpicc/mpifort/mpirun/mpiexec. > > > > > > > > > On May 26, 2016, at 3:46 PM, Megdich Islem <megdich_is...@yahoo.fr> wrote: > > > > > > Thank you all for your suggestions !! > > > > > > I found an answer to a similar case in Open MPI FAQ (Question 15) > > > FAQ: Running MPI jobs > > > > > > > > > > > > > > > > > > > > > > > > > > > FAQ: Running MPI jobs > > > Table of contents: What pre-requisites are necessary for running an Open > > > MPI job? What ABI guarantees does Open MPI provide? Do I need a common > > > filesystem on a... > > > Afficher sur www.open-mpi.org > > > Aper?u par Yahoo > > > > > > which suggests to use mpirun's prefix command line option or to use the > > > mpirun wrapper. > > > > > > I modified my command to the following > > > mpirun --prefix > > > /opt/openfoam30/platforms/linux64GccDPInt32Opt/lib/Openmpi-system -np 1 > > > pimpleDyMFoam -case OF > > > > > > But, I got an error (see attached picture). Is the syntax correct? How > > > can I solve the problem? That first method seems to be easier than using > > > the mpirun wrapper. > > > > > > Otherwise, how can I use the mpirun wrapper? > > > > > > Regards, > > > islem > > > > > > > > > Le Mercredi 25 mai 2016 16h40, Dave Love <d.l...@liverpool.ac.uk> a ?crit > > > : > > > > > > > > > I wrote: > > > > > > > > > > You could wrap one (set of) program(s) in a script to set the > > > > appropriate environment before invoking the real program. > > > > > > > > > I realize I should have said something like "program invocations", > > > i.e. if you have no control over something invoking mpirun for programs > > > using different MPIs, then an mpirun wrapper needs to check what it's > > > being asked to run. > > > > > > > > > > > > <mpirun-error.png><path-to-open-mpi.png>_______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2016/05/29317.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ------------------------------ > > > > Message: 2 > > Date: Fri, 27 May 2016 08:16:41 +0200 > > From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> > > To: Open MPI Users <us...@open-mpi.org> > > Subject: Re: [OMPI users] segmentation fault for slot-list and > > openmpi-1.10.3rc2 > > Message-ID: > > <f5653a5c-174f-4569-c730-082a9db82...@informatik.hs-fulda.de> > > Content-Type: text/plain; charset=windows-1252; format=flowed > > > > Hi Ralph, > > > > > > Am 26.05.2016 um 17:38 schrieb Ralph Castain: > > > I?m afraid I honestly can?t make any sense of it. It seems > > > you at least have a simple workaround (use a hostfile instead > > > of -host), yes? > > > > Only the combination "--host" and "--slot-list" breaks. > > Everything else works as expected. One more remark: As you > > can see below, this combination worked using gdb and "next" > > after the breakpoint. The process blocks, if I keep the > > enter-key pressed down and I have to kill simple_spawn in > > another window to get control back in gdb (<Ctrl-c> or > > anything else didn't work). I got this error yesterday > > evening. > > > > ... > > (gdb) > > ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbc0c) > > at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > > 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > > (gdb) > > 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > > (gdb) > > 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > > (gdb) > > 758 if (OMPI_SUCCESS != (ret = ompi_proc_complete_init())) { > > (gdb) > > 764 ret = MCA_PML_CALL(enable(true)); > > (gdb) > > 765 if( OMPI_SUCCESS != ret ) { > > (gdb) > > 771 if (NULL == (procs = ompi_proc_world(&nprocs))) { > > (gdb) > > 775 ret = MCA_PML_CALL(add_procs(procs, nprocs)); > > (gdb) > > 776 free(procs); > > (gdb) > > 780 if (OMPI_ERR_UNREACH == ret) { > > (gdb) > > 785 } else if (OMPI_SUCCESS != ret) { > > (gdb) > > 790 MCA_PML_CALL(add_comm(&ompi_mpi_comm_world.comm)); > > (gdb) > > 791 MCA_PML_CALL(add_comm(&ompi_mpi_comm_self.comm)); > > (gdb) > > 796 if (ompi_mpi_show_mca_params) { > > (gdb) > > 803 ompi_rte_wait_for_debugger(); > > (gdb) > > 807 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > > (gdb) > > 817 coll = OBJ_NEW(ompi_rte_collective_t); > > (gdb) > > 818 coll->id = ompi_process_info.peer_init_barrier; > > (gdb) > > 819 coll->active = true; > > (gdb) > > 820 if (OMPI_SUCCESS != (ret = ompi_rte_barrier(coll))) { > > (gdb) > > 825 OMPI_WAIT_FOR_COMPLETION(coll->active); > > (gdb) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Program received signal SIGTERM, Terminated. > > 0x00007ffff7a7acd0 in opal_progress@plt () > > from /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12 > > (gdb) > > Single stepping until exit from function opal_progress@plt, > > which has no line number information. > > [Thread 0x7ffff491b700 (LWP 19602) exited] > > > > Program terminated with signal SIGTERM, Terminated. > > The program no longer exists. > > (gdb) > > The program is not being run. > > (gdb) > > ... > > > > > > > > Kind regards > > > > Siegmar > > > > > > >> On May 26, 2016, at 5:48 AM, Siegmar Gross > > >> <siegmar.gr...@informatik.hs-fulda.de> wrote: > > >> > > >> Hi Ralph and Gilles, > > >> > > >> it's strange that the program works with "--host" and "--slot-list" > > >> in your environment and not in mine. I get the following output, if > > >> I run the program in gdb without a breakpoint. > > >> > > >> > > >> loki spawn 142 gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec > > >> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1 > > >> ... > > >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn > > >> (gdb) run > > >> Starting program: /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 > > >> --host loki --slot-list 0:0-1,1:0-1 simple_spawn > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> Detaching after fork from child process 18031. > > >> [pid 18031] starting up! > > >> 0 completed MPI_Init > > >> Parent [pid 18031] about to spawn! > > >> Detaching after fork from child process 18033. > > >> Detaching after fork from child process 18034. > > >> [pid 18033] starting up! > > >> [pid 18034] starting up! > > >> [loki:18034] *** Process received signal *** > > >> [loki:18034] Signal: Segmentation fault (11) > > >> ... > > >> > > >> > > >> > > >> I get a different output, if I run the program in gdb with > > >> a breakpoint. > > >> > > >> gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec > > >> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn > > >> (gbd) set follow-fork-mode child > > >> (gdb) break ompi_proc_self > > >> (gdb) run > > >> (gdb) next > > >> > > >> Repeating "next" very often results in the following output. > > >> > > >> ... > > >> Starting program: > > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [pid 13277] starting up! > > >> [New Thread 0x7ffff42ef700 (LWP 13289)] > > >> > > >> Breakpoint 1, ompi_proc_self (size=0x7fffffffc060) > > >> at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 > > >> 413 ompi_proc_t **procs = (ompi_proc_t**) > > >> malloc(sizeof(ompi_proc_t*)); > > >> (gdb) n > > >> 414 if (NULL == procs) { > > >> (gdb) > > >> 423 OBJ_RETAIN(ompi_proc_local_proc); > > >> (gdb) > > >> 424 *procs = ompi_proc_local_proc; > > >> (gdb) > > >> 425 *size = 1; > > >> (gdb) > > >> 426 return procs; > > >> (gdb) > > >> 427 } > > >> (gdb) > > >> ompi_comm_init () at > > >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138 > > >> 138 group->grp_my_rank = 0; > > >> (gdb) > > >> 139 group->grp_proc_count = (int)size; > > >> ... > > >> 193 ompi_comm_reg_init(); > > >> (gdb) > > >> 196 ompi_comm_request_init (); > > >> (gdb) > > >> 198 return OMPI_SUCCESS; > > >> (gdb) > > >> 199 } > > >> (gdb) > > >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffc21c) > > >> at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > > >> 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > > >> (gdb) > > >> 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > > >> (gdb) > > >> 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > > >> ... > > >> 988 ompi_mpi_initialized = true; > > >> (gdb) > > >> 991 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > > >> (gdb) > > >> 999 return MPI_SUCCESS; > > >> (gdb) > > >> 1000 } > > >> (gdb) > > >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94 > > >> 94 if (MPI_SUCCESS != err) { > > >> (gdb) > > >> 104 return MPI_SUCCESS; > > >> (gdb) > > >> 105 } > > >> (gdb) > > >> 0x0000000000400d0c in main () > > >> (gdb) > > >> Single stepping until exit from function main, > > >> which has no line number information. > > >> 0 completed MPI_Init > > >> Parent [pid 13277] about to spawn! > > >> [New process 13472] > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> process 13472 is executing new program: > > >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [New process 13474] > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> process 13474 is executing new program: > > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > > >> [pid 13475] starting up! > > >> [pid 13476] starting up! > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [pid 13474] starting up! > > >> [New Thread 0x7ffff491b700 (LWP 13480)] > > >> [Switching to Thread 0x7ffff7ff1740 (LWP 13474)] > > >> > > >> Breakpoint 1, ompi_proc_self (size=0x7fffffffba30) > > >> at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 > > >> 413 ompi_proc_t **procs = (ompi_proc_t**) > > >> malloc(sizeof(ompi_proc_t*)); > > >> (gdb) > > >> 414 if (NULL == procs) { > > >> ... > > >> 426 return procs; > > >> (gdb) > > >> 427 } > > >> (gdb) > > >> ompi_comm_init () at > > >> ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138 > > >> 138 group->grp_my_rank = 0; > > >> (gdb) > > >> 139 group->grp_proc_count = (int)size; > > >> (gdb) > > >> 140 OMPI_GROUP_SET_INTRINSIC (group); > > >> ... > > >> 193 ompi_comm_reg_init(); > > >> (gdb) > > >> 196 ompi_comm_request_init (); > > >> (gdb) > > >> 198 return OMPI_SUCCESS; > > >> (gdb) > > >> 199 } > > >> (gdb) > > >> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbbec) > > >> at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738 > > >> 738 if (OMPI_SUCCESS != (ret = ompi_file_init())) { > > >> (gdb) > > >> 744 if (OMPI_SUCCESS != (ret = ompi_win_init())) { > > >> (gdb) > > >> 750 if (OMPI_SUCCESS != (ret = ompi_attr_init())) { > > >> ... > > >> 863 if (OMPI_SUCCESS != (ret = ompi_pubsub_base_select())) { > > >> (gdb) > > >> 869 if (OMPI_SUCCESS != (ret = > > >> mca_base_framework_open(&ompi_dpm_base_framework, 0))) { > > >> (gdb) > > >> 873 if (OMPI_SUCCESS != (ret = ompi_dpm_base_select())) { > > >> (gdb) > > >> 884 if ( OMPI_SUCCESS != > > >> (gdb) > > >> 894 if (OMPI_SUCCESS != > > >> (gdb) > > >> 900 if (OMPI_SUCCESS != > > >> (gdb) > > >> 911 if (OMPI_SUCCESS != (ret = ompi_dpm.dyn_init())) { > > >> (gdb) > > >> Parent done with spawn > > >> Parent sending message to child > > >> 2 completed MPI_Init > > >> Hello from the child 2 of 3 on host loki pid 13476 > > >> 1 completed MPI_Init > > >> Hello from the child 1 of 3 on host loki pid 13475 > > >> 921 if (OMPI_SUCCESS != (ret = ompi_cr_init())) { > > >> (gdb) > > >> 931 opal_progress_event_users_decrement(); > > >> (gdb) > > >> 934 opal_progress_set_yield_when_idle(ompi_mpi_yield_when_idle); > > >> (gdb) > > >> 937 if (ompi_mpi_event_tick_rate >= 0) { > > >> (gdb) > > >> 946 if (OMPI_SUCCESS != (ret = ompi_mpiext_init())) { > > >> (gdb) > > >> 953 if (ret != OMPI_SUCCESS) { > > >> (gdb) > > >> 972 OBJ_CONSTRUCT(&ompi_registered_datareps, opal_list_t); > > >> (gdb) > > >> 977 OBJ_CONSTRUCT( &ompi_mpi_f90_integer_hashtable, > > >> opal_hash_table_t); > > >> (gdb) > > >> 978 opal_hash_table_init(&ompi_mpi_f90_integer_hashtable, 16 /* > > >> why not? */); > > >> (gdb) > > >> 980 OBJ_CONSTRUCT( &ompi_mpi_f90_real_hashtable, > > >> opal_hash_table_t); > > >> (gdb) > > >> 981 opal_hash_table_init(&ompi_mpi_f90_real_hashtable, > > >> FLT_MAX_10_EXP); > > >> (gdb) > > >> 983 OBJ_CONSTRUCT( &ompi_mpi_f90_complex_hashtable, > > >> opal_hash_table_t); > > >> (gdb) > > >> 984 opal_hash_table_init(&ompi_mpi_f90_complex_hashtable, > > >> FLT_MAX_10_EXP); > > >> (gdb) > > >> 988 ompi_mpi_initialized = true; > > >> (gdb) > > >> 991 if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) { > > >> (gdb) > > >> 999 return MPI_SUCCESS; > > >> (gdb) > > >> 1000 } > > >> (gdb) > > >> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94 > > >> 94 if (MPI_SUCCESS != err) { > > >> (gdb) > > >> 104 return MPI_SUCCESS; > > >> (gdb) > > >> 105 } > > >> (gdb) > > >> 0x0000000000400d0c in main () > > >> (gdb) > > >> Single stepping until exit from function main, > > >> which has no line number information. > > >> 0 completed MPI_Init > > >> Hello from the child 0 of 3 on host loki pid 13474 > > >> > > >> Child 2 disconnected > > >> Child 1 disconnected > > >> Child 0 received msg: 38 > > >> Parent disconnected > > >> 13277: exiting > > >> > > >> Program received signal SIGTERM, Terminated. > > >> 0x0000000000400f0a in main () > > >> (gdb) > > >> Single stepping until exit from function main, > > >> which has no line number information. > > >> [tcsetpgrp failed in terminal_inferior: No such process] > > >> [Thread 0x7ffff491b700 (LWP 13480) exited] > > >> > > >> Program terminated with signal SIGTERM, Terminated. > > >> The program no longer exists. > > >> (gdb) > > >> The program is not being run. > > >> (gdb) > > >> The program is not being run. > > >> (gdb) info break > > >> Num Type Disp Enb Address What > > >> 1 breakpoint keep y 0x00007ffff7aa35c7 in ompi_proc_self > > >> at > > >> ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 inf 8, 7, 6, 5, 4, 3, 2, 1 > > >> breakpoint already hit 2 times > > >> (gdb) delete 1 > > >> (gdb) r > > >> Starting program: > > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [pid 16708] starting up! > > >> 0 completed MPI_Init > > >> Parent [pid 16708] about to spawn! > > >> [New process 16720] > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> process 16720 is executing new program: > > >> /usr/local/openmpi-1.10.3_64_gcc/bin/orted > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [New process 16722] > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> process 16722 is executing new program: > > >> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn > > >> [pid 16723] starting up! > > >> [pid 16724] starting up! > > >> [Thread debugging using libthread_db enabled] > > >> Using host libthread_db library "/lib64/libthread_db.so.1". > > >> [pid 16722] starting up! > > >> Parent done with spawn > > >> Parent sending message to child > > >> 1 completed MPI_Init > > >> Hello from the child 1 of 3 on host loki pid 16723 > > >> 2 completed MPI_Init > > >> Hello from the child 2 of 3 on host loki pid 16724 > > >> 0 completed MPI_Init > > >> Hello from the child 0 of 3 on host loki pid 16722 > > >> Child 0 received msg: 38 > > >> Child 0 disconnected > > >> Parent disconnected > > >> Child 1 disconnected > > >> Child 2 disconnected > > >> 16708: exiting > > >> 16724: exiting > > >> 16723: exiting > > >> [New Thread 0x7ffff491b700 (LWP 16729)] > > >> > > >> Program received signal SIGTERM, Terminated. > > >> [Switching to Thread 0x7ffff7ff1740 (LWP 16722)] > > >> __GI__dl_debug_state () at dl-debug.c:74 > > >> 74 dl-debug.c: No such file or directory. > > >> (gdb) > > >> -------------------------------------------------------------------------- > > >> WARNING: A process refused to die despite all the efforts! > > >> This process may still be running and/or consuming resources. > > >> > > >> Host: loki > > >> PID: 16722 > > >> > > >> -------------------------------------------------------------------------- > > >> > > >> > > >> The following simple_spawn processes exist now. > > >> > > >> loki spawn 171 ps -aef | grep simple_spawn > > >> fd1026 11079 11053 0 14:00 pts/0 00:00:00 > > >> /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 --host loki > > >> --slot-list 0:0-1,1:0-1 simple_spawn > > >> fd1026 11095 11079 29 14:01 pts/0 00:09:37 [simple_spawn] <defunct> > > >> fd1026 16722 1 0 14:31 ? 00:00:00 [simple_spawn] <defunct> > > >> fd1026 17271 29963 0 14:33 pts/2 00:00:00 grep simple_spawn > > >> loki spawn 172 > > >> > > >> > > >> Is it possible that there is a race condition? How can I help > > >> to get a solution for my problem? > > >> > > >> > > >> Kind regards > > >> > > >> Siegmar > > >> > > >> Am 24.05.2016 um 16:54 schrieb Ralph Castain: > > >>> Works perfectly for me, so I believe this must be an environment issue > > >>> - I am using gcc 6.0.0 on CentOS7 with x86: > > >>> > > >>> $ mpirun -n 1 -host bend001 --slot-list 0:0-1,1:0-1 --report-bindings > > >>> ./simple_spawn > > >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>> [pid 17601] starting up! > > >>> 0 completed MPI_Init > > >>> Parent [pid 17601] about to spawn! > > >>> [pid 17603] starting up! > > >>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>> [bend001:17599] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>> [bend001:17599] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > > >>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>> [pid 17604] starting up! > > >>> [pid 17605] starting up! > > >>> Parent done with spawn > > >>> Parent sending message to child > > >>> 0 completed MPI_Init > > >>> Hello from the child 0 of 3 on host bend001 pid 17603 > > >>> Child 0 received msg: 38 > > >>> 1 completed MPI_Init > > >>> Hello from the child 1 of 3 on host bend001 pid 17604 > > >>> 2 completed MPI_Init > > >>> Hello from the child 2 of 3 on host bend001 pid 17605 > > >>> Child 0 disconnected > > >>> Child 2 disconnected > > >>> Parent disconnected > > >>> Child 1 disconnected > > >>> 17603: exiting > > >>> 17605: exiting > > >>> 17601: exiting > > >>> 17604: exiting > > >>> $ > > >>> > > >>>> On May 24, 2016, at 7:18 AM, Siegmar Gross > > >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: > > >>>> > > >>>> Hi Ralph and Gilles, > > >>>> > > >>>> the program breaks only, if I combine "--host" and "--slot-list". > > >>>> Perhaps this > > >>>> information is helpful. I use a different machine now, so that you can > > >>>> see that > > >>>> the problem is not restricted to "loki". > > >>>> > > >>>> > > >>>> pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C > > >>>> compiler absolute:" > > >>>> OPAL repo revision: v1.10.2-201-gd23dda8 > > >>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > > >>>> > > >>>> > > >>>> pc03 spawn 116 uname -a > > >>>> Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 > > >>>> (4354e1d) x86_64 x86_64 x86_64 GNU/Linux > > >>>> > > >>>> > > >>>> pc03 spawn 117 cat host_pc03.openmpi > > >>>> pc03.informatik.hs-fulda.de slots=12 max_slots=12 > > >>>> > > >>>> > > >>>> pc03 spawn 118 mpicc simple_spawn.c > > >>>> > > >>>> > > >>>> pc03 spawn 119 mpiexec -np 1 --report-bindings a.out > > >>>> [pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > > >>>> [BB/../../../../..][../../../../../..] > > >>>> [pid 3713] starting up! > > >>>> 0 completed MPI_Init > > >>>> Parent [pid 3713] about to spawn! > > >>>> [pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket > > >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > > >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > > >>>> [../../../../../..][BB/BB/BB/BB/BB/BB] > > >>>> [pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > > >>>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: > > >>>> [BB/BB/BB/BB/BB/BB][../../../../../..] > > >>>> [pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket > > >>>> 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > > >>>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > > >>>> [../../../../../..][BB/BB/BB/BB/BB/BB] > > >>>> [pid 3715] starting up! > > >>>> [pid 3716] starting up! > > >>>> [pid 3717] starting up! > > >>>> Parent done with spawn > > >>>> Parent sending message to child > > >>>> 0 completed MPI_Init > > >>>> Hello from the child 0 of 3 on host pc03 pid 3715 > > >>>> 1 completed MPI_Init > > >>>> Hello from the child 1 of 3 on host pc03 pid 3716 > > >>>> 2 completed MPI_Init > > >>>> Hello from the child 2 of 3 on host pc03 pid 3717 > > >>>> Child 0 received msg: 38 > > >>>> Child 0 disconnected > > >>>> Child 2 disconnected > > >>>> Parent disconnected > > >>>> Child 1 disconnected > > >>>> 3713: exiting > > >>>> 3715: exiting > > >>>> 3716: exiting > > >>>> 3717: exiting > > >>>> > > >>>> > > >>>> pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list > > >>>> 0:0-1,1:0-1 --report-bindings a.out > > >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pid 3731] starting up! > > >>>> 0 completed MPI_Init > > >>>> Parent [pid 3731] about to spawn! > > >>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pid 3733] starting up! > > >>>> [pid 3734] starting up! > > >>>> [pid 3735] starting up! > > >>>> Parent done with spawn > > >>>> Parent sending message to child > > >>>> 2 completed MPI_Init > > >>>> Hello from the child 2 of 3 on host pc03 pid 3735 > > >>>> 1 completed MPI_Init > > >>>> Hello from the child 1 of 3 on host pc03 pid 3734 > > >>>> 0 completed MPI_Init > > >>>> Hello from the child 0 of 3 on host pc03 pid 3733 > > >>>> Child 0 received msg: 38 > > >>>> Child 0 disconnected > > >>>> Child 2 disconnected > > >>>> Child 1 disconnected > > >>>> Parent disconnected > > >>>> 3731: exiting > > >>>> 3734: exiting > > >>>> 3733: exiting > > >>>> 3735: exiting > > >>>> > > >>>> > > >>>> pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1 > > >>>> --report-bindings a.out > > >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pid 3746] starting up! > > >>>> 0 completed MPI_Init > > >>>> Parent [pid 3746] about to spawn! > > >>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket > > >>>> 0[core 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > >>>> 0-1]]: [BB/BB/../../../..][BB/BB/../../../..] > > >>>> [pid 3748] starting up! > > >>>> [pid 3749] starting up! > > >>>> [pc03:03749] *** Process received signal *** > > >>>> [pc03:03749] Signal: Segmentation fault (11) > > >>>> [pc03:03749] Signal code: Address not mapped (1) > > >>>> [pc03:03749] Failing at address: 0x8 > > >>>> [pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870] > > >>>> [pc03:03749] [ 1] > > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0] > > >>>> [pc03:03749] [ 2] > > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08] > > >>>> [pc03:03749] [ 3] > > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a] > > >>>> [pc03:03749] [ 4] > > >>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae] > > >>>> [pc03:03749] [ 5] a.out[0x400d0c] > > >>>> [pc03:03749] [ 6] > > >>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe6f0989b05] > > >>>> [pc03:03749] [ 7] a.out[0x400bf9] > > >>>> [pc03:03749] *** End of error message *** > > >>>> -------------------------------------------------------------------------- > > >>>> mpiexec noticed that process rank 2 with PID 3749 on node pc03 exited > > >>>> on signal 11 (Segmentation fault). > > >>>> -------------------------------------------------------------------------- > > >>>> pc03 spawn 122 > > >>>> > > >>>> > > >>>> > > >>>> Kind regards > > >>>> > > >>>> Siegmar > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On 05/24/16 15:44, Ralph Castain wrote: > > >>>>> > > >>>>>> On May 24, 2016, at 6:21 AM, Siegmar Gross > > >>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: > > >>>>>> > > >>>>>> Hi Ralph, > > >>>>>> > > >>>>>> I copy the relevant lines to this place, so that it is easier to see > > >>>>>> what > > >>>>>> happens. "a.out" is your program, which I compiled with mpicc. > > >>>>>> > > >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C > > >>>>>>>> compiler > > >>>>>>>> absolute:" > > >>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > > >>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > > >>>>>>>> loki spawn 154 mpicc simple_spawn.c > > >>>>>> > > >>>>>>>> loki spawn 155 mpiexec -np 1 a.out > > >>>>>>>> [pid 24008] starting up! > > >>>>>>>> 0 completed MPI_Init > > >>>>>> ... > > >>>>>> > > >>>>>> "mpiexec -np 1 a.out" works. > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>>> I don?t know what ?a.out? is, but it looks like there is some memory > > >>>>>>> corruption there. > > >>>>>> > > >>>>>> "a.out" is still your program. I get the same error on different > > >>>>>> machines, so that it is not very likely, that the (hardware) memory > > >>>>>> is corrupted. > > >>>>>> > > >>>>>> > > >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > > >>>>>>>> [pid 24102] starting up! > > >>>>>>>> 0 completed MPI_Init > > >>>>>>>> Parent [pid 24102] about to spawn! > > >>>>>>>> [pid 24104] starting up! > > >>>>>>>> [pid 24105] starting up! > > >>>>>>>> [loki:24105] *** Process received signal *** > > >>>>>>>> [loki:24105] Signal: Segmentation fault (11) > > >>>>>>>> [loki:24105] Signal code: Address not mapped (1) > > >>>>>> ... > > >>>>>> > > >>>>>> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a > > >>>>>> segmentation > > >>>>>> faUlt. Can I do something, so that you can find out, what happens? > > >>>>> > > >>>>> I honestly have no idea - perhaps Gilles can help, as I have no > > >>>>> access to that kind of environment. We aren?t seeing such problems > > >>>>> elsewhere, so it is likely something local. > > >>>>> > > >>>>>> > > >>>>>> > > >>>>>> Kind regards > > >>>>>> > > >>>>>> Siegmar > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On 05/24/16 15:07, Ralph Castain wrote: > > >>>>>>> > > >>>>>>>> On May 24, 2016, at 4:19 AM, Siegmar Gross > > >>>>>>>> <siegmar.gr...@informatik.hs-fulda.de > > >>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: > > >>>>>>>> > > >>>>>>>> Hi Ralph, > > >>>>>>>> > > >>>>>>>> thank you very much for your answer and your example program. > > >>>>>>>> > > >>>>>>>> On 05/23/16 17:45, Ralph Castain wrote: > > >>>>>>>>> I cannot replicate the problem - both scenarios work fine for me. > > >>>>>>>>> I?m not > > >>>>>>>>> convinced your test code is correct, however, as you call > > >>>>>>>>> Comm_free the > > >>>>>>>>> inter-communicator but didn?t call Comm_disconnect. Checkout the > > >>>>>>>>> attached > > >>>>>>>>> for a correct code and see if it works for you. > > >>>>>>>> > > >>>>>>>> I thought that I only need MPI_Comm_Disconnect, if I would have > > >>>>>>>> established a > > >>>>>>>> connection with MPI_Comm_connect before. The man page for > > >>>>>>>> MPI_Comm_free states > > >>>>>>>> > > >>>>>>>> "This operation marks the communicator object for deallocation. > > >>>>>>>> The > > >>>>>>>> handle is set to MPI_COMM_NULL. Any pending operations that use > > >>>>>>>> this > > >>>>>>>> communicator will complete normally; the object is actually > > >>>>>>>> deallocated only > > >>>>>>>> if there are no other active references to it.". > > >>>>>>>> > > >>>>>>>> The man page for MPI_Comm_disconnect states > > >>>>>>>> > > >>>>>>>> "MPI_Comm_disconnect waits for all pending communication on comm > > >>>>>>>> to complete > > >>>>>>>> internally, deallocates the communicator object, and sets the > > >>>>>>>> handle to > > >>>>>>>> MPI_COMM_NULL. It is a collective operation.". > > >>>>>>>> > > >>>>>>>> I don't see a difference for my spawned processes, because both > > >>>>>>>> functions will > > >>>>>>>> "wait" until all pending operations have finished, before the > > >>>>>>>> object will be > > >>>>>>>> destroyed. Nevertheless, perhaps my small example program worked > > >>>>>>>> all the years > > >>>>>>>> by chance. > > >>>>>>>> > > >>>>>>>> However, I don't understand, why my program works with > > >>>>>>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and > > >>>>>>>> breaks with > > >>>>>>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". > > >>>>>>>> You are right, > > >>>>>>>> my slot-list is equivalent to "-bind-to none". I could also have > > >>>>>>>> used > > >>>>>>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which > > >>>>>>>> works as well. > > >>>>>>> > > >>>>>>> Well, you are only giving us one slot when you specify "-host > > >>>>>>> loki?, and then > > >>>>>>> you are trying to launch multiple processes into it. The > > >>>>>>> ?slot-list? option only > > >>>>>>> tells us what cpus to bind each process to - it doesn?t allocate > > >>>>>>> process slots. > > >>>>>>> So you have to tell us how many processes are allowed to run on > > >>>>>>> this node. > > >>>>>>> > > >>>>>>>> > > >>>>>>>> The program breaks with "There are not enough slots available in > > >>>>>>>> the system > > >>>>>>>> to satisfy ...", if I only use "--host loki" or different host > > >>>>>>>> names, > > >>>>>>>> without mentioning five host names, using "slot-list", or > > >>>>>>>> "oversubscribe", > > >>>>>>>> Unfortunately "--host <host name>:<number of slots>" isn't > > >>>>>>>> available for > > >>>>>>>> openmpi-1.10.3rc2 to specify the number of available slots. > > >>>>>>> > > >>>>>>> Correct - we did not backport the new syntax > > >>>>>>> > > >>>>>>>> > > >>>>>>>> Your program behaves the same way as mine, so that > > >>>>>>>> MPI_Comm_disconnect > > >>>>>>>> will not solve my problem. I had to modify your program in a > > >>>>>>>> negligible way > > >>>>>>>> to get it compiled. > > >>>>>>>> > > >>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C > > >>>>>>>> compiler absolute:" > > >>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > > >>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > > >>>>>>>> loki spawn 154 mpicc simple_spawn.c > > >>>>>>>> loki spawn 155 mpiexec -np 1 a.out > > >>>>>>>> [pid 24008] starting up! > > >>>>>>>> 0 completed MPI_Init > > >>>>>>>> Parent [pid 24008] about to spawn! > > >>>>>>>> [pid 24010] starting up! > > >>>>>>>> [pid 24011] starting up! > > >>>>>>>> [pid 24012] starting up! > > >>>>>>>> Parent done with spawn > > >>>>>>>> Parent sending message to child > > >>>>>>>> 0 completed MPI_Init > > >>>>>>>> Hello from the child 0 of 3 on host loki pid 24010 > > >>>>>>>> 1 completed MPI_Init > > >>>>>>>> Hello from the child 1 of 3 on host loki pid 24011 > > >>>>>>>> 2 completed MPI_Init > > >>>>>>>> Hello from the child 2 of 3 on host loki pid 24012 > > >>>>>>>> Child 0 received msg: 38 > > >>>>>>>> Child 0 disconnected > > >>>>>>>> Child 1 disconnected > > >>>>>>>> Child 2 disconnected > > >>>>>>>> Parent disconnected > > >>>>>>>> 24012: exiting > > >>>>>>>> 24010: exiting > > >>>>>>>> 24008: exiting > > >>>>>>>> 24011: exiting > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Is something wrong with my command line? I didn't use slot-list > > >>>>>>>> before, so > > >>>>>>>> that I'm not sure, if I use it in the intended way. > > >>>>>>> > > >>>>>>> I don?t know what ?a.out? is, but it looks like there is some > > >>>>>>> memory corruption > > >>>>>>> there. > > >>>>>>> > > >>>>>>>> > > >>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > > >>>>>>>> [pid 24102] starting up! > > >>>>>>>> 0 completed MPI_Init > > >>>>>>>> Parent [pid 24102] about to spawn! > > >>>>>>>> [pid 24104] starting up! > > >>>>>>>> [pid 24105] starting up! > > >>>>>>>> [loki:24105] *** Process received signal *** > > >>>>>>>> [loki:24105] Signal: Segmentation fault (11) > > >>>>>>>> [loki:24105] Signal code: Address not mapped (1) > > >>>>>>>> [loki:24105] Failing at address: 0x8 > > >>>>>>>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870] > > >>>>>>>> [loki:24105] [ 1] > > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0] > > >>>>>>>> [loki:24105] [ 2] > > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08] > > >>>>>>>> [loki:24105] [ 3] *** An error occurred in MPI_Init > > >>>>>>>> *** on a NULL communicator > > >>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > > >>>>>>>> abort, > > >>>>>>>> *** and potentially your MPI job) > > >>>>>>>> [loki:24104] Local abort before MPI_INIT completed successfully; > > >>>>>>>> not able to > > >>>>>>>> aggregate error messages, and not able to guarantee that all other > > >>>>>>>> processes > > >>>>>>>> were killed! > > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a] > > >>>>>>>> [loki:24105] [ 4] > > >>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae] > > >>>>>>>> [loki:24105] [ 5] a.out[0x400d0c] > > >>>>>>>> [loki:24105] [ 6] > > >>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05] > > >>>>>>>> [loki:24105] [ 7] a.out[0x400bf9] > > >>>>>>>> [loki:24105] *** End of error message *** > > >>>>>>>> ------------------------------------------------------- > > >>>>>>>> Child job 2 terminated normally, but 1 process returned > > >>>>>>>> a non-zero exit code.. Per user-direction, the job has been > > >>>>>>>> aborted. > > >>>>>>>> ------------------------------------------------------- > > >>>>>>>> -------------------------------------------------------------------------- > > >>>>>>>> mpiexec detected that one or more processes exited with non-zero > > >>>>>>>> status, thus > > >>>>>>>> causing > > >>>>>>>> the job to be terminated. The first process to do so was: > > >>>>>>>> > > >>>>>>>> Process name: [[49560,2],0] > > >>>>>>>> Exit code: 1 > > >>>>>>>> -------------------------------------------------------------------------- > > >>>>>>>> loki spawn 157 > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Hopefully, you will find out what happens. Please let me know, if > > >>>>>>>> I can > > >>>>>>>> help you in any way. > > >>>>>>>> > > >>>>>>>> Kind regards > > >>>>>>>> > > >>>>>>>> Siegmar > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> FWIW: I don?t know how many cores you have on your sockets, but > > >>>>>>>>> if you > > >>>>>>>>> have 6 cores/socket, then your slot-list is equivalent to > > >>>>>>>>> ??bind-to none? > > >>>>>>>>> as the slot-list applies to every process being launched > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross > > >>>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de > > >>>>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server > > >>>>>>>>>> 12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get > > >>>>>>>>>> a segmentation fault for "--slot-list" for one of my small > > >>>>>>>>>> programs. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C > > >>>>>>>>>> compiler > > >>>>>>>>>> absolute:" > > >>>>>>>>>> OPAL repo revision: v1.10.2-201-gd23dda8 > > >>>>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki > > >>>>>>>>>> spawn_master > > >>>>>>>>>> > > >>>>>>>>>> Parent process 0 running on loki > > >>>>>>>>>> I create 4 slave processes > > >>>>>>>>>> > > >>>>>>>>>> Parent process 0: tasks in MPI_COMM_WORLD: 1 > > >>>>>>>>>> tasks in COMM_CHILD_PROCESSES local group: 1 > > >>>>>>>>>> tasks in COMM_CHILD_PROCESSES remote group: 4 > > >>>>>>>>>> > > >>>>>>>>>> Slave process 0 of 4 running on loki > > >>>>>>>>>> Slave process 1 of 4 running on loki > > >>>>>>>>>> Slave process 2 of 4 running on loki > > >>>>>>>>>> spawn_slave 2: argv[0]: spawn_slave > > >>>>>>>>>> Slave process 3 of 4 running on loki > > >>>>>>>>>> spawn_slave 0: argv[0]: spawn_slave > > >>>>>>>>>> spawn_slave 1: argv[0]: spawn_slave > > >>>>>>>>>> spawn_slave 3: argv[0]: spawn_slave > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > > >>>>>>>>>> spawn_master > > >>>>>>>>>> > > >>>>>>>>>> Parent process 0 running on loki > > >>>>>>>>>> I create 4 slave processes > > >>>>>>>>>> > > >>>>>>>>>> [loki:17326] *** Process received signal *** > > >>>>>>>>>> [loki:17326] Signal: Segmentation fault (11) > > >>>>>>>>>> [loki:17326] Signal code: Address not mapped (1) > > >>>>>>>>>> [loki:17326] Failing at address: 0x8 > > >>>>>>>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870] > > >>>>>>>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init > > >>>>>>>>>> *** on a NULL communicator > > >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > > >>>>>>>>>> now abort, > > >>>>>>>>>> *** and potentially your MPI job) > > >>>>>>>>>> [loki:17324] Local abort before MPI_INIT completed successfully; > > >>>>>>>>>> not able to > > >>>>>>>>>> aggregate error messages, and not able to guarantee that all > > >>>>>>>>>> other processes > > >>>>>>>>>> were killed! > > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0] > > >>>>>>>>>> [loki:17326] [ 2] > > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08] > > >>>>>>>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init > > >>>>>>>>>> *** on a NULL communicator > > >>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > > >>>>>>>>>> now abort, > > >>>>>>>>>> *** and potentially your MPI job) > > >>>>>>>>>> [loki:17325] Local abort before MPI_INIT completed successfully; > > >>>>>>>>>> not able to > > >>>>>>>>>> aggregate error messages, and not able to guarantee that all > > >>>>>>>>>> other processes > > >>>>>>>>>> were killed! > > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a] > > >>>>>>>>>> [loki:17326] [ 4] > > >>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e] > > >>>>>>>>>> [loki:17326] [ 5] spawn_slave[0x40097e] > > >>>>>>>>>> [loki:17326] [ 6] > > >>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05] > > >>>>>>>>>> [loki:17326] [ 7] spawn_slave[0x400a54] > > >>>>>>>>>> [loki:17326] *** End of error message *** > > >>>>>>>>>> ------------------------------------------------------- > > >>>>>>>>>> Child job 2 terminated normally, but 1 process returned > > >>>>>>>>>> a non-zero exit code.. Per user-direction, the job has been > > >>>>>>>>>> aborted. > > >>>>>>>>>> ------------------------------------------------------- > > >>>>>>>>>> -------------------------------------------------------------------------- > > >>>>>>>>>> mpiexec detected that one or more processes exited with non-zero > > >>>>>>>>>> status, > > >>>>>>>>>> thus causing > > >>>>>>>>>> the job to be terminated. The first process to do so was: > > >>>>>>>>>> > > >>>>>>>>>> Process name: [[56340,2],0] > > >>>>>>>>>> Exit code: 1 > > >>>>>>>>>> -------------------------------------------------------------------------- > > >>>>>>>>>> loki spawn 122 > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> I would be grateful, if somebody can fix the problem. Thank you > > >>>>>>>>>> very much for any help in advance. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Kind regards > > >>>>>>>>>> > > >>>>>>>>>> Siegmar > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> users mailing list > > >>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > > >>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>>> Link to this post: > > >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> users mailing list > > >>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > > >>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>> Link to this > > >>>>>>>>> post: > > >>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29284.php > > >>>>>>>>> > > >>>>>>>> <simple_spawn_modified.c>_______________________________________________ > > >>>>>>>> users mailing list > > >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> > > >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>> Link to this post: > > >>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> users mailing list > > >>>>>>> us...@open-mpi.org > > >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>> Link to this post: > > >>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29301.php > > >>>>>>> > > >>>>>> _______________________________________________ > > >>>>>> users mailing list > > >>>>>> us...@open-mpi.org > > >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>> Link to this post: > > >>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29304.php > > >>>>> > > >>>>> _______________________________________________ > > >>>>> users mailing list > > >>>>> us...@open-mpi.org > > >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>> Link to this post: > > >>>>> http://www.open-mpi.org/community/lists/users/2016/05/29307.php > > >>>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> Link to this post: > > >>>> http://www.open-mpi.org/community/lists/users/2016/05/29308.php > > >>> > > >>> _______________________________________________ > > >>> users mailing list > > >>> us...@open-mpi.org > > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >>> Link to this post: > > >>> http://www.open-mpi.org/community/lists/users/2016/05/29309.php > > >>> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > >> Link to this post: > > >> http://www.open-mpi.org/community/lists/users/2016/05/29315.php > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2016/05/29316.php > > > > > > > > > > > ------------------------------ > > > > Message: 3 > > Date: Fri, 27 May 2016 09:14:42 +0000 > > From: "Marco D'Amico" <marco.damic...@gmail.com> > > To: us...@open-mpi.org > > Subject: [OMPI users] OpenMPI virtualization aware > > Message-ID: > > <CABi-01XH+vdi2egBD=knen_cyxpecg0j-+3rtvnfnc6mtd+...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > Hi I'm recently investigating in Virtualization used in HPC field, and I > > found out that MVAPICH has a "Virtualization aware" version, that permit to > > overcome the big latencies problems of using a Virtualization environment > > for HPC. > > > > My question is if there is any similar efforts in OpenMPI, since I would > > eventually contribute in it. > > > > Best regards, > > Marco D'Amico > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Message: 4 > > Date: Fri, 27 May 2016 06:45:05 -0700 > > From: Ralph Castain <r...@open-mpi.org> > > To: Open MPI Users <us...@open-mpi.org> > > Subject: Re: [OMPI users] OpenMPI virtualization aware > > Message-ID: <bbeb8e66-40b0-4688-8284-2113252e1...@open-mpi.org> > > Content-Type: text/plain; charset="utf-8" > > > > Hi Marco > > > > OMPI has integrated support for the Singularity container: > > > > http://singularity.lbl.gov/index.html > > <http://singularity.lbl.gov/index.html> > > > > https://groups.google.com/a/lbl.gov/forum/#!forum/singularity > > <https://groups.google.com/a/lbl.gov/forum/#!forum/singularity> > > > > It is in OMPI master now, and an early version is in 2.0 - the full > > integration will be in 2.1. Singularity is undergoing changes for its 2.0 > > release (so we?ll need to do some updating of the OMPI integration), and > > there is still plenty that can be done to further optimize its integration > > - so contributions would be welcome! > > > > Ralph > > > > > > > > > On May 27, 2016, at 2:14 AM, Marco D'Amico <marco.damic...@gmail.com> > > > wrote: > > > > > > Hi I'm recently investigating in Virtualization used in HPC field, and I > > > found out that MVAPICH has a "Virtualization aware" version, that permit > > > to overcome the big latencies problems of using a Virtualization > > > environment for HPC. > > > > > > My question is if there is any similar efforts in OpenMPI, since I would > > > eventually contribute in it. > > > > > > Best regards, > > > Marco D'Amico > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2016/05/29320.php > > > > -------------- next part -------------- > > HTML attachment scrubbed and removed > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > https://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------ > > > > End of users Digest, Vol 3514, Issue 1 > > ************************************** > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/06/29341.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > users mailing list > us...@open-mpi.org > https://www.open-mpi.org/mailman/listinfo.cgi/users > > ------------------------------ > > End of users Digest, Vol 3518, Issue 2 > ************************************** > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29344.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/