Re: [OMPI users] seg fault with intel compiler

2012-06-05 Thread Edmund Sumbar
First of all, thanks to everyone who took the trouble to offer suggests.

The solution seems to be to upgrade the Intel compilers. However, I'm not
the cluster admin, so other crucial changes may have been implemented. For
example, I know that ssh was reconfigured over the weekend (but that
shouldn't impact OMPI in a Torque environment).

In any case, I went from version 12.1.0.233 (Build 20110811) to 12.1.4.319
(Build 20120410), and rebuilt Open MPI 1.6. After that, all tests worked,
for any number of tasks.

-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360


Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Gus Correa

On 06/01/2012 05:06 PM, Edmund Sumbar wrote:

Thanks for the tips Gus. I'll definitely try some of these, particularly
the nodes:ppn syntax, and report back.



You can check for torque support with

mpicc --showme

It should show among other things -ltorque [if it
has torque support] and -lrdmacm -libverbs [if it
has OpenIB/Infinband support].

If Torque is not installed in a standard location
[such as /usr or /usr/local],
which is often the case, you may need
to point configure to the Torque library with:

--with-tm=/path/to/torque

Likewise for Infinband/OpenIB if you have it:

--with-openib=/path/to/openib

[I am citing these options from memory.
Do a './configure -help' to check the right syntax, please.]

Making a log file of your configure run may be helpful, to
diagnose problems.

Finally, if I remember right, there was some problem
reported in the list regarding Intel compilers 12.1.
[I built 1.4.5 with Intel 11 and it works fine.]
However, that problem may have been superseded in
the latest OpenMPI 1.6.0.
[The release notes will tell, or perhaps Jeff.]

I hope this helps,
Gus Correa


Right now, I'm upgrading the Intel Compilers
and rebuilding Open MPI.


On Fri, Jun 1, 2012 at 2:39 PM, Gus Correa > wrote:

The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome,
and may not be understood by the scheduler [It doesn't
work correctly with Maui, which is what we have here.  I read
people saying it works with pbs_sched and with Moab,
but that's hearsay.]
This issue comes back very often in the Torque mailing
list.

Have you tried instead this alternate syntax?

'-l nodes=2:ppn=24'

[I am assuming here that your
nodes have 24 cores, i.e. 24 'ppn', each]

Then in the script:
mpiexec -np 48 ./your_program


Also, in your PBS script you could print
the contents of PBS_NODEFILE.

cat $PBS_NODEFILE


A simple troubleshooting test is to launch 'hostname'
with mpirun

mpirun -np 48 hostname

Finally, are you sure that the OpenMPI you are using was
compiled with Torque support?
If not, I wonder if clauses like '-bynode' would work at all.
Jeff may correct me if I am wrong, but if your
OpenMPI lacks Torque support,
you may need to pass to mpirun
the $PBS_NODEFILE as your hostfile.




--
Edmund Sumbar
University of Alberta
+1 780 492 9360



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Edmund Sumbar
Thanks for the tips Gus. I'll definitely try some of these, particularly
the nodes:ppn syntax, and report back.

Right now, I'm upgrading the Intel Compilers and rebuilding Open MPI.


On Fri, Jun 1, 2012 at 2:39 PM, Gus Correa  wrote:

> The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome,
> and may not be understood by the scheduler [It doesn't
> work correctly with Maui, which is what we have here.  I read
> people saying it works with pbs_sched and with Moab,
> but that's hearsay.]
> This issue comes back very often in the Torque mailing
> list.
>
> Have you tried instead this alternate syntax?
>
> '-l nodes=2:ppn=24'
>
> [I am assuming here that your
> nodes have 24 cores, i.e. 24 'ppn', each]
>
> Then in the script:
> mpiexec -np 48 ./your_program
>
>
> Also, in your PBS script you could print
> the contents of PBS_NODEFILE.
>
> cat $PBS_NODEFILE
>
>
> A simple troubleshooting test is to launch 'hostname'
> with mpirun
>
> mpirun -np 48 hostname
>
> Finally, are you sure that the OpenMPI you are using was
> compiled with Torque support?
> If not, I wonder if clauses like '-bynode' would work at all.
> Jeff may correct me if I am wrong, but if your
> OpenMPI lacks Torque support,
> you may need to pass to mpirun
> the $PBS_NODEFILE as your hostfile.
>



-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360


Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Gus Correa

Hi Edmund

The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome,
and may not be understood by the scheduler [It doesn't
work correctly with Maui, which is what we have here.  I read
people saying it works with pbs_sched and with Moab,
but that's hearsay.]
This issue comes back very often in the Torque mailing
list.

Have you tried instead this alternate syntax?

'-l nodes=2:ppn=24'

[I am assuming here that your
nodes have 24 cores, i.e. 24 'ppn', each]

Then in the script:
mpiexec -np 48 ./your_program


Also, in your PBS script you could print
the contents of PBS_NODEFILE.

cat $PBS_NODEFILE


A simple troubleshooting test is to launch 'hostname'
with mpirun

mpirun -np 48 hostname

Finally, are you sure that the OpenMPI you are using was
compiled with Torque support?
If not, I wonder if clauses like '-bynode' would work at all.
Jeff may correct me if I am wrong, but if your
OpenMPI lacks Torque support,
you may need to pass to mpirun
the $PBS_NODEFILE as your hostfile.

I hope this helps,
Gus Correa


On 06/01/2012 11:26 AM, Edmund Sumbar wrote:

On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres > wrote:

It's been a lng time since I've run under PBS, so I don't
remember if your script's environment is copied out to the remote
nodes where your application actually runs.

Can you verify that PATH and LD_LIBRARY_PATH are the same on all
nodes in your PBS allocation after you module load?


I compiled the following program and invoked it with "mpiexec -bynode
./test-env" in a PBS script.

#include "mpi.h"
#include 
#include 
#include 

int main (int argc, char *argv[])
{
   int i, rank, size, namelen;
   MPI_Status stat;

   MPI_Init (, );

   MPI_Comm_size (MPI_COMM_WORLD, );
   MPI_Comm_rank (MPI_COMM_WORLD, );

   printf("rank: %d: ld_library_path: %s\n", rank,
getenv("LD_LIBRARY_PATH"));

   MPI_Finalize ();

   return (0);
}

I submitted the script with "qsub -l procs=24 job.pbs", and got

rank: 4: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

rank: 3: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

...more of the same...

When I submitted it with -l procs=48, I got

[cl2n004:11617] *** Process received signal ***
[cl2n004:11617] Signal: Segmentation fault (11)
[cl2n004:11617] Signal code: Address not mapped (1)
[cl2n004:11617] Failing at address: 0x10
[cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
[cl2n004:11617] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
[0x2af788a98113]
[cl2n004:11617] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
[0x2af788a9a8a9]
[cl2n004:11617] [ 3]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
[0x2af788a9a596]
[cl2n004:11617] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
[0x2af78c916654]
[cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
[cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
[cl2n004:11617] *** End of error message ***
--
mpiexec noticed that process rank 4 with PID 11617 on node cl2n004
exited on signal 11 (Segmentation fault).
--

It seems that failures happen for arbitrary reasons. When I added a line
in the PBS script to print out the node allocation, the procs=24 case
failed, but then it worked a few seconds later, with the same list of

Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Edmund Sumbar
On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres  wrote:

> It's been a lng time since I've run under PBS, so I don't remember if
> your script's environment is copied out to the remote nodes where your
> application actually runs.
>
> Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in
> your PBS allocation after you module load?
>

I compiled the following program and invoked it with "mpiexec -bynode
./test-env" in a PBS script.

#include "mpi.h"
#include 
#include 
#include 

int main (int argc, char *argv[])
{
  int i, rank, size, namelen;
  MPI_Status stat;

  MPI_Init (, );

  MPI_Comm_size (MPI_COMM_WORLD, );
  MPI_Comm_rank (MPI_COMM_WORLD, );

  printf("rank: %d: ld_library_path: %s\n", rank,
getenv("LD_LIBRARY_PATH"));

  MPI_Finalize ();

  return (0);
}

I submitted the script with "qsub -l procs=24 job.pbs", and got

rank: 4: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

rank: 3: ld_library_path:
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

...more of the same...

When I submitted it with -l procs=48, I got

[cl2n004:11617] *** Process received signal ***
[cl2n004:11617] Signal: Segmentation fault (11)
[cl2n004:11617] Signal code: Address not mapped (1)
[cl2n004:11617] Failing at address: 0x10
[cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
[cl2n004:11617] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
[0x2af788a98113]
[cl2n004:11617] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
[0x2af788a9a8a9]
[cl2n004:11617] [ 3]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
[0x2af788a9a596]
[cl2n004:11617] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
[0x2af78c916654]
[cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
[cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
[cl2n004:11617] *** End of error message ***
--
mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited
on signal 11 (Segmentation fault).
--

It seems that failures happen for arbitrary reasons. When I added a line in
the PBS script to print out the node allocation, the procs=24 case failed,
but then it worked a few seconds later, with the same list of allocated
nodes. So there's definitely something amiss with the cluster, although I
wouldn't know where to start investigating. Perhaps there is a
pre-installed OMPI somewhere that's interfering, but I'm doubtful.

By the way, thanks for all the support.

-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360


Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Edmund Sumbar
On Fri, Jun 1, 2012 at 5:00 AM, Jeff Squyres  wrote:

> Try running:
>
> which mpirun
> ssh cl2n022 which mpirun
> ssh cl2n010 which mpirun
>
> and
>
> ldd your_mpi_executable
> ssh cl2n022 which mpirun
> ssh cl2n010 which mpirun
>
> Compare the results and ensure that you're finding the same mpirun on all
> nodes, and the same libmpi.so on all nodes.  There may well be another Open
> MPI installed in some non-default location of which you're unaware.
>

I'll try that Jeff (results given below). However, I suspect there must be
something goofy about this (brand new) cluster itself because among the
countless jobs that failed, I got one job that ran without error, and all I
ever did was to rearrange the echo and which commands. We've also observed
some peculiar behaviour on this cluster using Intel MPI that seemed to be
associated with the number of tasks requested. And after more
experimentation, the Open MPI version of the program also seems to be
sensitive to the number of tasks (e.g., works with 48, fails with 64).

Thanks for the feedback Jeff, but I think the ball is firmly in my court.



I ran the following PBS script with "qsub -l procs=128 job.pbs".
Environment variables are set using the Environment Modules packages.

echo $HOSTNAME
which mpiexec
module load library/openmpi/1.6-intel
which mpiexec
echo $PATH
echo $LD_LIBRARY_PATH
ldd test-ompi16
mpiexec --prefix /lustre/jasper/software/openmpi/openmpi-1.6-intel
./test-ompi16

Standard output gave

cl2n011

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin/mpiexec

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64:/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin

/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

linux-vdso.so.1 =>  (0x7fffb5358000)
libmpi.so.1 =>
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
(0x2b3968d1d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00329ce0)
libimf.so =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libimf.so
(0x2b3969137000)
libm.so.6 => /lib64/libm.so.6 (0x00329d20)
librt.so.1 => /lib64/librt.so.1 (0x00329da0)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0032a640)
libutil.so.1 => /lib64/libutil.so.1 (0x0032a840)
libsvml.so =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libsvml.so
(0x2b3969504000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0032a4c0)
libintlc.so.5 =>
/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libintlc.so.5
(0x2b3969c77000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00329d60)
libc.so.6 => /lib64/libc.so.6 (0x00329ca0)
/lib64/ld-linux-x86-64.so.2 (0x00329c20)


Standard error gave

which: no mpiexec in
(/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin)

[cl2n005:05142] *** Process received signal ***
[cl2n005:05142] Signal: Segmentation fault (11)
[cl2n005:05142] Signal code: Address not mapped (1)
[cl2n005:05142] Failing at address: 0x10
[cl2n005:05142] [ 0] /lib64/libpthread.so.0 [0x373180ebe0]
[cl2n005:05142] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
[0x2aff9aad5113]
[cl2n005:05142] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
[0x2aff9aad78a9]
[cl2n005:05142] [ 3]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
[0x2aff9aad7596]
[cl2n005:05142] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_grow+0x89)
[0x2aff9aa0fa59]
[cl2n005:05142] [ 5]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_init_ex+0x9c)
[0x2aff9aa0fd8c]
[cl2n005:05142] [ 6]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so

Re: [OMPI users] seg fault with intel compiler

2012-06-01 Thread Jeff Squyres
Try running:

which mpirun
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

and

ldd your_mpi_executable
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

Compare the results and ensure that you're finding the same mpirun on all 
nodes, and the same libmpi.so on all nodes.  There may well be another Open MPI 
installed in some non-default location of which you're unaware.


On May 31, 2012, at 8:21 PM, Edmund Sumbar wrote:

> Thanks for the tip Jeff,
> 
> I wish it was that simple. Unfortunately, this is the only version installed. 
> When I added --prefix to the mpiexec command line, I still got a seg fault, 
> but without the backtrace. Oh well, I'll keep trying (compiler upgrade etc).
> 
> [cl2n022:03057] *** Process received signal ***
> [cl2n022:03057] Signal: Segmentation fault (11)
> [cl2n022:03057] Signal code: Address not mapped (1)
> [cl2n022:03057] Failing at address: 0x10
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> [cl2n010:16470] *** Process received signal ***
> [cl2n010:16470] Signal: Segmentation fault (11)
> [cl2n010:16470] Signal code: Address not mapped (1)
> [cl2n010:16470] Failing at address: 0x10
> --
> mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited on 
> signal 11 (Segmentation fault).
> --
> 
> 
> On Thu, May 31, 2012 at 2:54 PM, Jeff Squyres  wrote:
> This type of error usually means that you are inadvertently mixing versions 
> of Open MPI (e.g., version A.B.C on one node and D.E.F on another node).
> 
> 
> 
> -- 
> Edmund Sumbar
> University of Alberta
> +1 780 492 9360
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] seg fault with intel compiler

2012-05-31 Thread Edmund Sumbar
Thanks for the tip Jeff,

I wish it was that simple. Unfortunately, this is the only version
installed. When I added --prefix to the mpiexec command line, I still got a
seg fault, but without the backtrace. Oh well, I'll keep trying (compiler
upgrade etc).

[cl2n022:03057] *** Process received signal ***
[cl2n022:03057] Signal: Segmentation fault (11)
[cl2n022:03057] Signal code: Address not mapped (1)
[cl2n022:03057] Failing at address: 0x10
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n010:16470] *** Process received signal ***
[cl2n010:16470] Signal: Segmentation fault (11)
[cl2n010:16470] Signal code: Address not mapped (1)
[cl2n010:16470] Failing at address: 0x10
--
mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited
on signal 11 (Segmentation fault).
--


On Thu, May 31, 2012 at 2:54 PM, Jeff Squyres  wrote:

> This type of error usually means that you are inadvertently mixing
> versions of Open MPI (e.g., version A.B.C on one node and D.E.F on another
> node).




-- 
Edmund Sumbar
University of Alberta
+1 780 492 9360


Re: [OMPI users] seg fault with intel compiler

2012-05-31 Thread Jeff Squyres
This type of error usually means that you are inadvertently mixing versions of 
Open MPI (e.g., version A.B.C on one node and D.E.F on another node).

Ensure that your paths are setup consistently and that you're getting both the 
same OMPI tools in your $path and the same libmpi.so in your $LD_LIBRARY_PATH.



On May 31, 2012, at 3:43 PM, Edmund Sumbar wrote:

> Hi,
> 
> I feel like a dope. I can't seem to successfully run the following simple 
> test program (from Intel MPI distro) as a Torque batch job on a Cent OS 5.7 
> cluster with Open MPI 1.6 compiled using Intel compilers 12.1.0.233.
> 
> If I comment out MPI_Get_processor_name(), it works.
> 
> #include "mpi.h"
> #include 
> #include 
> 
> int
> main (int argc, char *argv[])
> {
> int i, rank, size, namelen;
> char name[MPI_MAX_PROCESSOR_NAME];
> MPI_Status stat;
> 
> MPI_Init (, );
> 
> MPI_Comm_size (MPI_COMM_WORLD, );
> MPI_Comm_rank (MPI_COMM_WORLD, );
> MPI_Get_processor_name (name, );
> 
> if (rank == 0) {
> 
> printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
> 
> for (i = 1; i < size; i++) {
> MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, );
> MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, );
> MPI_Recv (, 1, MPI_INT, i, 1, MPI_COMM_WORLD, );
> MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, );
> printf ("Hello world: rank %d of %d running on %s\n", rank, size, 
> name);
> }
> 
> } else {
> 
> MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
> MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
> MPI_Send (, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
> MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
> 
> }
> 
> MPI_Finalize ();
> 
> return (0);
> }
> 
> The result I get is
> 
> [cl2n007:19441] *** Process received signal ***
> [cl2n007:19441] Signal: Segmentation fault (11)
> [cl2n007:19441] Signal code: Address not mapped (1)
> [cl2n007:19441] Failing at address: 0x10
> [cl2n007:19441] [ 0] /lib64/libpthread.so.0 [0x306980ebe0]
> [cl2n007:19441] [ 1] 
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
>  [0x2af078563113]
> [cl2n007:19441] [ 2] 
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
>  [0x2af0785658a9]
> [cl2n007:19441] [ 3] 
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 
> [0x2af078565596]
> [cl2n007:19441] [ 4] 
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_class_initialize+0xaa)
>  [0x2af078582faa]
> [cl2n007:19441] [ 5] 
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
>  [0x2af07c3e1909]
> [cl2n007:19441] [ 6] /lib64/libpthread.so.0 [0x306980677d]
> [cl2n007:19441] [ 7] /lib64/libc.so.6(clone+0x6d) [0x3068cd325d]
> [cl2n007:19441] *** End of error message ***
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line[cl2n007:19434] 
> [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in 
> file util/nidmap.c at line 776
>  2342
> [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file util/nidmap.c at line 776
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file ess_tm_module.c at line 310
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end 
> of buffer in file base/odls_base_default_fns.c at line 2342
> 
> ...more of the same...
> 
> 
> $ ompi_info 
>  Package: Open MPI r...@jasper.westgrid.ca Distribution
> Open MPI: 1.6
>Open MPI SVN revision: r26429
>Open MPI release date: May 10, 2012
> Open RTE: 1.6
>Open RTE SVN revision: r26429
>Open RTE release date: May 10, 2012
> OPAL: 1.6
>OPAL SVN revision: r26429
>OPAL release date: May 10, 2012
>  MPI API: 2.1
> Ident string: 1.6
>   Prefix: /lustre/jasper/software/openmpi/openmpi-1.6-intel
>  Configured architecture: x86_64-unknown-linux-gnu
>   Configure host: jasper.westgrid.ca
>Configured by: root
>Configured on: Wed May 30 13:56:39 MDT 2012
>   Configure host: jasper.westgrid.ca
> Built by: root
>