Re: [OMPI users] seg fault with intel compiler
First of all, thanks to everyone who took the trouble to offer suggests. The solution seems to be to upgrade the Intel compilers. However, I'm not the cluster admin, so other crucial changes may have been implemented. For example, I know that ssh was reconfigured over the weekend (but that shouldn't impact OMPI in a Torque environment). In any case, I went from version 12.1.0.233 (Build 20110811) to 12.1.4.319 (Build 20120410), and rebuilt Open MPI 1.6. After that, all tests worked, for any number of tasks. -- Edmund Sumbar University of Alberta +1 780 492 9360
Re: [OMPI users] seg fault with intel compiler
On 06/01/2012 05:06 PM, Edmund Sumbar wrote: Thanks for the tips Gus. I'll definitely try some of these, particularly the nodes:ppn syntax, and report back. You can check for torque support with mpicc --showme It should show among other things -ltorque [if it has torque support] and -lrdmacm -libverbs [if it has OpenIB/Infinband support]. If Torque is not installed in a standard location [such as /usr or /usr/local], which is often the case, you may need to point configure to the Torque library with: --with-tm=/path/to/torque Likewise for Infinband/OpenIB if you have it: --with-openib=/path/to/openib [I am citing these options from memory. Do a './configure -help' to check the right syntax, please.] Making a log file of your configure run may be helpful, to diagnose problems. Finally, if I remember right, there was some problem reported in the list regarding Intel compilers 12.1. [I built 1.4.5 with Intel 11 and it works fine.] However, that problem may have been superseded in the latest OpenMPI 1.6.0. [The release notes will tell, or perhaps Jeff.] I hope this helps, Gus Correa Right now, I'm upgrading the Intel Compilers and rebuilding Open MPI. On Fri, Jun 1, 2012 at 2:39 PM, Gus Correa> wrote: The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome, and may not be understood by the scheduler [It doesn't work correctly with Maui, which is what we have here. I read people saying it works with pbs_sched and with Moab, but that's hearsay.] This issue comes back very often in the Torque mailing list. Have you tried instead this alternate syntax? '-l nodes=2:ppn=24' [I am assuming here that your nodes have 24 cores, i.e. 24 'ppn', each] Then in the script: mpiexec -np 48 ./your_program Also, in your PBS script you could print the contents of PBS_NODEFILE. cat $PBS_NODEFILE A simple troubleshooting test is to launch 'hostname' with mpirun mpirun -np 48 hostname Finally, are you sure that the OpenMPI you are using was compiled with Torque support? If not, I wonder if clauses like '-bynode' would work at all. Jeff may correct me if I am wrong, but if your OpenMPI lacks Torque support, you may need to pass to mpirun the $PBS_NODEFILE as your hostfile. -- Edmund Sumbar University of Alberta +1 780 492 9360 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] seg fault with intel compiler
Thanks for the tips Gus. I'll definitely try some of these, particularly the nodes:ppn syntax, and report back. Right now, I'm upgrading the Intel Compilers and rebuilding Open MPI. On Fri, Jun 1, 2012 at 2:39 PM, Gus Correawrote: > The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome, > and may not be understood by the scheduler [It doesn't > work correctly with Maui, which is what we have here. I read > people saying it works with pbs_sched and with Moab, > but that's hearsay.] > This issue comes back very often in the Torque mailing > list. > > Have you tried instead this alternate syntax? > > '-l nodes=2:ppn=24' > > [I am assuming here that your > nodes have 24 cores, i.e. 24 'ppn', each] > > Then in the script: > mpiexec -np 48 ./your_program > > > Also, in your PBS script you could print > the contents of PBS_NODEFILE. > > cat $PBS_NODEFILE > > > A simple troubleshooting test is to launch 'hostname' > with mpirun > > mpirun -np 48 hostname > > Finally, are you sure that the OpenMPI you are using was > compiled with Torque support? > If not, I wonder if clauses like '-bynode' would work at all. > Jeff may correct me if I am wrong, but if your > OpenMPI lacks Torque support, > you may need to pass to mpirun > the $PBS_NODEFILE as your hostfile. > -- Edmund Sumbar University of Alberta +1 780 492 9360
Re: [OMPI users] seg fault with intel compiler
Hi Edmund The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome, and may not be understood by the scheduler [It doesn't work correctly with Maui, which is what we have here. I read people saying it works with pbs_sched and with Moab, but that's hearsay.] This issue comes back very often in the Torque mailing list. Have you tried instead this alternate syntax? '-l nodes=2:ppn=24' [I am assuming here that your nodes have 24 cores, i.e. 24 'ppn', each] Then in the script: mpiexec -np 48 ./your_program Also, in your PBS script you could print the contents of PBS_NODEFILE. cat $PBS_NODEFILE A simple troubleshooting test is to launch 'hostname' with mpirun mpirun -np 48 hostname Finally, are you sure that the OpenMPI you are using was compiled with Torque support? If not, I wonder if clauses like '-bynode' would work at all. Jeff may correct me if I am wrong, but if your OpenMPI lacks Torque support, you may need to pass to mpirun the $PBS_NODEFILE as your hostfile. I hope this helps, Gus Correa On 06/01/2012 11:26 AM, Edmund Sumbar wrote: On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres> wrote: It's been a lng time since I've run under PBS, so I don't remember if your script's environment is copied out to the remote nodes where your application actually runs. Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in your PBS allocation after you module load? I compiled the following program and invoked it with "mpiexec -bynode ./test-env" in a PBS script. #include "mpi.h" #include #include #include int main (int argc, char *argv[]) { int i, rank, size, namelen; MPI_Status stat; MPI_Init (, ); MPI_Comm_size (MPI_COMM_WORLD, ); MPI_Comm_rank (MPI_COMM_WORLD, ); printf("rank: %d: ld_library_path: %s\n", rank, getenv("LD_LIBRARY_PATH")); MPI_Finalize (); return (0); } I submitted the script with "qsub -l procs=24 job.pbs", and got rank: 4: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 rank: 3: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 ...more of the same... When I submitted it with -l procs=48, I got [cl2n004:11617] *** Process received signal *** [cl2n004:11617] Signal: Segmentation fault (11) [cl2n004:11617] Signal code: Address not mapped (1) [cl2n004:11617] Failing at address: 0x10 [cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0] [cl2n004:11617] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af788a98113] [cl2n004:11617] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af788a9a8a9] [cl2n004:11617] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af788a9a596] [cl2n004:11617] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af78c916654] [cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d] [cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d] [cl2n004:11617] *** End of error message *** -- mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited on signal 11 (Segmentation fault). -- It seems that failures happen for arbitrary reasons. When I added a line in the PBS script to print out the node allocation, the procs=24 case failed, but then it worked a few seconds later, with the same list of
Re: [OMPI users] seg fault with intel compiler
On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyreswrote: > It's been a lng time since I've run under PBS, so I don't remember if > your script's environment is copied out to the remote nodes where your > application actually runs. > > Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in > your PBS allocation after you module load? > I compiled the following program and invoked it with "mpiexec -bynode ./test-env" in a PBS script. #include "mpi.h" #include #include #include int main (int argc, char *argv[]) { int i, rank, size, namelen; MPI_Status stat; MPI_Init (, ); MPI_Comm_size (MPI_COMM_WORLD, ); MPI_Comm_rank (MPI_COMM_WORLD, ); printf("rank: %d: ld_library_path: %s\n", rank, getenv("LD_LIBRARY_PATH")); MPI_Finalize (); return (0); } I submitted the script with "qsub -l procs=24 job.pbs", and got rank: 4: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 rank: 3: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 ...more of the same... When I submitted it with -l procs=48, I got [cl2n004:11617] *** Process received signal *** [cl2n004:11617] Signal: Segmentation fault (11) [cl2n004:11617] Signal code: Address not mapped (1) [cl2n004:11617] Failing at address: 0x10 [cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0] [cl2n004:11617] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af788a98113] [cl2n004:11617] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af788a9a8a9] [cl2n004:11617] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af788a9a596] [cl2n004:11617] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af78c916654] [cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d] [cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d] [cl2n004:11617] *** End of error message *** -- mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited on signal 11 (Segmentation fault). -- It seems that failures happen for arbitrary reasons. When I added a line in the PBS script to print out the node allocation, the procs=24 case failed, but then it worked a few seconds later, with the same list of allocated nodes. So there's definitely something amiss with the cluster, although I wouldn't know where to start investigating. Perhaps there is a pre-installed OMPI somewhere that's interfering, but I'm doubtful. By the way, thanks for all the support. -- Edmund Sumbar University of Alberta +1 780 492 9360
Re: [OMPI users] seg fault with intel compiler
On Fri, Jun 1, 2012 at 5:00 AM, Jeff Squyreswrote: > Try running: > > which mpirun > ssh cl2n022 which mpirun > ssh cl2n010 which mpirun > > and > > ldd your_mpi_executable > ssh cl2n022 which mpirun > ssh cl2n010 which mpirun > > Compare the results and ensure that you're finding the same mpirun on all > nodes, and the same libmpi.so on all nodes. There may well be another Open > MPI installed in some non-default location of which you're unaware. > I'll try that Jeff (results given below). However, I suspect there must be something goofy about this (brand new) cluster itself because among the countless jobs that failed, I got one job that ran without error, and all I ever did was to rearrange the echo and which commands. We've also observed some peculiar behaviour on this cluster using Intel MPI that seemed to be associated with the number of tasks requested. And after more experimentation, the Open MPI version of the program also seems to be sensitive to the number of tasks (e.g., works with 48, fails with 64). Thanks for the feedback Jeff, but I think the ball is firmly in my court. I ran the following PBS script with "qsub -l procs=128 job.pbs". Environment variables are set using the Environment Modules packages. echo $HOSTNAME which mpiexec module load library/openmpi/1.6-intel which mpiexec echo $PATH echo $LD_LIBRARY_PATH ldd test-ompi16 mpiexec --prefix /lustre/jasper/software/openmpi/openmpi-1.6-intel ./test-ompi16 Standard output gave cl2n011 /lustre/jasper/software/openmpi/openmpi-1.6-intel/bin/mpiexec /lustre/jasper/software/openmpi/openmpi-1.6-intel/bin:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64:/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64 linux-vdso.so.1 => (0x7fffb5358000) libmpi.so.1 => /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 (0x2b3968d1d000) libdl.so.2 => /lib64/libdl.so.2 (0x00329ce0) libimf.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libimf.so (0x2b3969137000) libm.so.6 => /lib64/libm.so.6 (0x00329d20) librt.so.1 => /lib64/librt.so.1 (0x00329da0) libnsl.so.1 => /lib64/libnsl.so.1 (0x0032a640) libutil.so.1 => /lib64/libutil.so.1 (0x0032a840) libsvml.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libsvml.so (0x2b3969504000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0032a4c0) libintlc.so.5 => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libintlc.so.5 (0x2b3969c77000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00329d60) libc.so.6 => /lib64/libc.so.6 (0x00329ca0) /lib64/ld-linux-x86-64.so.2 (0x00329c20) Standard error gave which: no mpiexec in (/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin) [cl2n005:05142] *** Process received signal *** [cl2n005:05142] Signal: Segmentation fault (11) [cl2n005:05142] Signal code: Address not mapped (1) [cl2n005:05142] Failing at address: 0x10 [cl2n005:05142] [ 0] /lib64/libpthread.so.0 [0x373180ebe0] [cl2n005:05142] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2aff9aad5113] [cl2n005:05142] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2aff9aad78a9] [cl2n005:05142] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2aff9aad7596] [cl2n005:05142] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_grow+0x89) [0x2aff9aa0fa59] [cl2n005:05142] [ 5] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_init_ex+0x9c) [0x2aff9aa0fd8c] [cl2n005:05142] [ 6] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
Re: [OMPI users] seg fault with intel compiler
Try running: which mpirun ssh cl2n022 which mpirun ssh cl2n010 which mpirun and ldd your_mpi_executable ssh cl2n022 which mpirun ssh cl2n010 which mpirun Compare the results and ensure that you're finding the same mpirun on all nodes, and the same libmpi.so on all nodes. There may well be another Open MPI installed in some non-default location of which you're unaware. On May 31, 2012, at 8:21 PM, Edmund Sumbar wrote: > Thanks for the tip Jeff, > > I wish it was that simple. Unfortunately, this is the only version installed. > When I added --prefix to the mpiexec command line, I still got a seg fault, > but without the backtrace. Oh well, I'll keep trying (compiler upgrade etc). > > [cl2n022:03057] *** Process received signal *** > [cl2n022:03057] Signal: Segmentation fault (11) > [cl2n022:03057] Signal code: Address not mapped (1) > [cl2n022:03057] Failing at address: 0x10 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > [cl2n010:16470] *** Process received signal *** > [cl2n010:16470] Signal: Segmentation fault (11) > [cl2n010:16470] Signal code: Address not mapped (1) > [cl2n010:16470] Failing at address: 0x10 > -- > mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited on > signal 11 (Segmentation fault). > -- > > > On Thu, May 31, 2012 at 2:54 PM, Jeff Squyreswrote: > This type of error usually means that you are inadvertently mixing versions > of Open MPI (e.g., version A.B.C on one node and D.E.F on another node). > > > > -- > Edmund Sumbar > University of Alberta > +1 780 492 9360 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] seg fault with intel compiler
Thanks for the tip Jeff, I wish it was that simple. Unfortunately, this is the only version installed. When I added --prefix to the mpiexec command line, I still got a seg fault, but without the backtrace. Oh well, I'll keep trying (compiler upgrade etc). [cl2n022:03057] *** Process received signal *** [cl2n022:03057] Signal: Segmentation fault (11) [cl2n022:03057] Signal code: Address not mapped (1) [cl2n022:03057] Failing at address: 0x10 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310 [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342 [cl2n010:16470] *** Process received signal *** [cl2n010:16470] Signal: Segmentation fault (11) [cl2n010:16470] Signal code: Address not mapped (1) [cl2n010:16470] Failing at address: 0x10 -- mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited on signal 11 (Segmentation fault). -- On Thu, May 31, 2012 at 2:54 PM, Jeff Squyreswrote: > This type of error usually means that you are inadvertently mixing > versions of Open MPI (e.g., version A.B.C on one node and D.E.F on another > node). -- Edmund Sumbar University of Alberta +1 780 492 9360
Re: [OMPI users] seg fault with intel compiler
This type of error usually means that you are inadvertently mixing versions of Open MPI (e.g., version A.B.C on one node and D.E.F on another node). Ensure that your paths are setup consistently and that you're getting both the same OMPI tools in your $path and the same libmpi.so in your $LD_LIBRARY_PATH. On May 31, 2012, at 3:43 PM, Edmund Sumbar wrote: > Hi, > > I feel like a dope. I can't seem to successfully run the following simple > test program (from Intel MPI distro) as a Torque batch job on a Cent OS 5.7 > cluster with Open MPI 1.6 compiled using Intel compilers 12.1.0.233. > > If I comment out MPI_Get_processor_name(), it works. > > #include "mpi.h" > #include > #include > > int > main (int argc, char *argv[]) > { > int i, rank, size, namelen; > char name[MPI_MAX_PROCESSOR_NAME]; > MPI_Status stat; > > MPI_Init (, ); > > MPI_Comm_size (MPI_COMM_WORLD, ); > MPI_Comm_rank (MPI_COMM_WORLD, ); > MPI_Get_processor_name (name, ); > > if (rank == 0) { > > printf ("Hello world: rank %d of %d running on %s\n", rank, size, name); > > for (i = 1; i < size; i++) { > MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, ); > MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, ); > MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, ); > MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, ); > printf ("Hello world: rank %d of %d running on %s\n", rank, size, > name); > } > > } else { > > MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); > MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); > MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); > MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); > > } > > MPI_Finalize (); > > return (0); > } > > The result I get is > > [cl2n007:19441] *** Process received signal *** > [cl2n007:19441] Signal: Segmentation fault (11) > [cl2n007:19441] Signal code: Address not mapped (1) > [cl2n007:19441] Failing at address: 0x10 > [cl2n007:19441] [ 0] /lib64/libpthread.so.0 [0x306980ebe0] > [cl2n007:19441] [ 1] > /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) > [0x2af078563113] > [cl2n007:19441] [ 2] > /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) > [0x2af0785658a9] > [cl2n007:19441] [ 3] > /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 > [0x2af078565596] > [cl2n007:19441] [ 4] > /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_class_initialize+0xaa) > [0x2af078582faa] > [cl2n007:19441] [ 5] > /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so > [0x2af07c3e1909] > [cl2n007:19441] [ 6] /lib64/libpthread.so.0 [0x306980677d] > [cl2n007:19441] [ 7] /lib64/libc.so.6(clone+0x6d) [0x3068cd325d] > [cl2n007:19441] *** End of error message *** > [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line[cl2n007:19434] > [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in > file util/nidmap.c at line 776 > 2342 > [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file util/nidmap.c at line 776 > [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file ess_tm_module.c at line 310 > [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file base/odls_base_default_fns.c at line 2342 > > ...more of the same... > > > $ ompi_info > Package: Open MPI r...@jasper.westgrid.ca Distribution > Open MPI: 1.6 >Open MPI SVN revision: r26429 >Open MPI release date: May 10, 2012 > Open RTE: 1.6 >Open RTE SVN revision: r26429 >Open RTE release date: May 10, 2012 > OPAL: 1.6 >OPAL SVN revision: r26429 >OPAL release date: May 10, 2012 > MPI API: 2.1 > Ident string: 1.6 > Prefix: /lustre/jasper/software/openmpi/openmpi-1.6-intel > Configured architecture: x86_64-unknown-linux-gnu > Configure host: jasper.westgrid.ca >Configured by: root >Configured on: Wed May 30 13:56:39 MDT 2012 > Configure host: jasper.westgrid.ca > Built by: root >