[OMPI users] openmpi-1.4.3 and pgi-11.6 segfault

2011-06-21 Thread Brock Palen
Has anyone else had issues building 1.4.3 with pgi 11.6?  When I do I get a 
segfault,

./configure --prefix=/home/software/rhel5/openmpi-1.4.3/pgi-11.6 
--mandir=/home/software/rhel5/openmpi-1.4.3/pgi-11.6/man 
--with-tm=/usr/local/torque/ --with-openib --with-psm CC=pgcc CXX=pgCC FC=pgf90 
F77=pgf90


make[7]: Entering directory 
`/tmp/openmpi-1.4.3/ompi/contrib/vt/vt/tools/opari/tool'
source='handler.cc' object='handler.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
handler.o handler.cc
source='ompragma.cc' object='ompragma.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
ompragma.o ompragma.cc
source='ompragma_c.cc' object='ompragma_c.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
ompragma_c.o ompragma_c.cc
source='ompragma_f.cc' object='ompragma_f.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
ompragma_f.o ompragma_f.cc
source='ompregion.cc' object='ompregion.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
ompregion.o ompregion.cc
source='opari.cc' object='opari.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
opari.o opari.cc
source='process_c.cc' object='process_c.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
process_c.o process_c.cc
source='process_f.cc' object='process_f.o' libtool=no \
DEPDIR=.deps depmode=none /bin/sh ../../../depcomp \
pgCC -DHAVE_CONFIG_H -I. -I../../..   -D_REENTRANT  -O -DNDEBUG   -c -o 
process_f.o process_f.cc
pgCC-Fatal-/afs/engin.umich.edu/caen/rhel_5/pgi-11.6/linux86-64/11.6/bin/pgcpp1 
TERMINATED by signal 11
Arguments to 
/afs/engin.umich.edu/caen/rhel_5/pgi-11.6/linux86-64/11.6/bin/pgcpp1
/afs/engin.umich.edu/caen/rhel_5/pgi-11.6/linux86-64/11.6/bin/pgcpp1 --llalign 
-Dunix -D__unix -D__unix__ -Dlinux -D__linux -D__linux__ -D__NO_MATH_INLINES 
-D__x86_64__ -D__LONG_MAX__=9223372036854775807L '-D__SIZE_TYPE__=unsigned long 
int' '-D__PTRDIFF_TYPE__=long int' -D__THROW= -D__extension__= -D__amd64__ 
-D__SSE__ -D__MMX__ -D__SSE2__ -D__SSE3__ -D__SSSE3__ -D__PGI -I. -I../../.. 
-DHAVE_CONFIG_H -D_REENTRANT -DNDEBUG 
-I/afs/engin.umich.edu/caen/rhel_5/pgi-11.6/linux86-64/11.6/include/CC 
-I/afs/engin.umich.edu/caen/rhel_5/pgi-11.6/linux86-64/11.6/include 
-I/usr/local/include -I/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include 
-I/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include -I/usr/include --zc_eh 
--gnu_version=40102 -D__pgnu_vsn=40102 -q -o /tmp/pgCCD_kbx1IQBbml.il 
process_f.cc
make[7]: *** [process_f.o] Error 127
make[7]: Leaving directory 
`/tmp/openmpi-1.4.3/ompi/contrib/vt/vt/tools/opari/tool'
make[6]: *** [all-recursive] Error 1
make[6]: Leaving directory `/tmp/openmpi-1.4.3/ompi/contrib/vt/vt/tools/opari'
make[5]: *** [all-recursive] Error 1
make[5]: Leaving directory `/tmp/openmpi-1.4.3/ompi/contrib/vt/vt/tools'
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory `/tmp/openmpi-1.4.3/ompi/contrib/vt/vt'
make[3]: *** [all] Error 2
make[3]: Leaving directory `/tmp/openmpi-1.4.3/ompi/contrib/vt/vt'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/tmp/openmpi-1.4.3/ompi/contrib/vt'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/openmpi-1.4.3/ompi'
make: *** [all-recursive] Error 1


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985






Re: [OMPI users] Problems on large clusters

2011-06-21 Thread Addepalli, Srirangam V
Hello Thorsten
What type of IB interface do you have (qlogic ?).  I often run into simarl 
issue when running 256 core jobs . It mostly happens for me as i hit a node 
with IB issues.
nothing related to openmpi.  If you are using qlogic PSM  try using ping-pong 
ex to check availability of all nodes.

Rangam

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Thorsten Schuett [schu...@zib.de]
Sent: Tuesday, June 21, 2011 10:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] Problems on large clusters

Hi,

I am running openmpi 1.5.3 on a IB cluster and I have problems starting jobs
on larger node counts. With small numbers of tasks, it usually works. But now
the startup failed three times in a row using 255 nodes. I am using 255 nodes
with one MPI task per node and the mpiexec looks as follows:

mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out

After ten minutes, I pulled a stracktrace on all nodes and killed the job,
because there was no progress. In the following, you will find the stack trace
generated with gdb thread apply all bt. The backtrace looks basically the same
on all nodes. It seems to hang in mpi_init.

Any help is appreciated,

Thorsten

Thread 3 (Thread 46914544122176 (LWP 28979)):
#0  0x2b6ee912d9a2 in select () from /lib64/libc.so.6
#1  0x2b6eeabd928d in service_thread_start (context=)
at btl_openib_fd.c:427
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 2 (Thread 46916594338112 (LWP 28980)):
#0  0x2b6ee912b8b6 in poll () from /lib64/libc.so.6
#1  0x2b6eeabd7b8a in btl_openib_async_thread (async=) at btl_openib_async.c:419
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 1 (Thread 47755361533088 (LWP 28978)):
#0  0x2b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6
#1  0x2b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0,
tv=) at epoll.c:215
#2  0x2b6ee8773309 in opal_event_base_loop (base=0xb79050, flags=) at event.c:838
#3  0x2b6ee875ee92 in opal_progress () at runtime/opal_progress.c:189
#4  0x39f1 in ?? ()
#5  0x2b6ee87979c9 in std::ios_base::Init::~Init () at
../../.././libstdc++-v3/src/ios_init.cc:123
#6  0x7fffc32c8cc8 in ?? ()
#7  0x2b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=, attribute_name=0x2b6ee88e5780 " \020322351n+",
val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
#8  0x2b6ee86dd511 in ompi_modex_recv_key_value (key=, source_proc=, value=0xbb3a00, dtype=14 '\016') at
runtime/ompi_module_exchange.c:125
#9  0x2b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
#10 0x2b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8,
requested=, provided=0x7fffc32c917c) at
runtime/ompi_mpi_init.c:699
#11 0x7fffc32c8e88 in ?? ()
#12 0x2b6ee77f8348 in ?? ()
#13 0x7fffc32c8e60 in ?? ()
#14 0x7fffc32c8e20 in ?? ()
#15 0x09efa994 in ?? ()
#16 0x in ?? ()
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Problems on large clusters

2011-06-21 Thread Gilbert Grosdidier

Bonjour Thorsten,

 Could you please be a little bit more specific about the cluster  
itself ?


 G.

Le 21 juin 11 à 17:46, Thorsten Schuett a écrit :


Hi,

I am running openmpi 1.5.3 on a IB cluster and I have problems  
starting jobs
on larger node counts. With small numbers of tasks, it usually  
works. But now
the startup failed three times in a row using 255 nodes. I am using  
255 nodes

with one MPI task per node and the mpiexec looks as follows:

mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out

After ten minutes, I pulled a stracktrace on all nodes and killed  
the job,
because there was no progress. In the following, you will find the  
stack trace
generated with gdb thread apply all bt. The backtrace looks  
basically the same

on all nodes. It seems to hang in mpi_init.

Any help is appreciated,

Thorsten

Thread 3 (Thread 46914544122176 (LWP 28979)):
#0  0x2b6ee912d9a2 in select () from /lib64/libc.so.6
#1  0x2b6eeabd928d in service_thread_start (context=optimized out>)

at btl_openib_fd.c:427
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 2 (Thread 46916594338112 (LWP 28980)):
#0  0x2b6ee912b8b6 in poll () from /lib64/libc.so.6
#1  0x2b6eeabd7b8a in btl_openib_async_thread (async=optimized

out>) at btl_openib_async.c:419
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 1 (Thread 47755361533088 (LWP 28978)):
#0  0x2b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6
#1  0x2b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0,
tv=) at epoll.c:215
#2  0x2b6ee8773309 in opal_event_base_loop (base=0xb79050,  
flags=
optimized out>) at event.c:838
#3  0x2b6ee875ee92 in opal_progress () at runtime/ 
opal_progress.c:189

#4  0x39f1 in ?? ()
#5  0x2b6ee87979c9 in std::ios_base::Init::~Init () at
../../.././libstdc++-v3/src/ios_init.cc:123
#6  0x7fffc32c8cc8 in ?? ()
#7  0x2b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=, attribute_name=0x2b6ee88e5780 " \020322351n+",
val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
#8  0x2b6ee86dd511 in ompi_modex_recv_key_value (key=optimized
out>, source_proc=, value=0xbb3a00, dtype=14  
'\016') at

runtime/ompi_module_exchange.c:125
#9  0x2b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
#10 0x2b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8,
requested=, provided=0x7fffc32c917c) at
runtime/ompi_mpi_init.c:699
#11 0x7fffc32c8e88 in ?? ()
#12 0x2b6ee77f8348 in ?? ()
#13 0x7fffc32c8e60 in ?? ()
#14 0x7fffc32c8e20 in ?? ()
#15 0x09efa994 in ?? ()
#16 0x in ?? ()
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







[OMPI users] Problems on large clusters

2011-06-21 Thread Thorsten Schuett
Hi, 

I am running openmpi 1.5.3 on a IB cluster and I have problems starting jobs 
on larger node counts. With small numbers of tasks, it usually works. But now 
the startup failed three times in a row using 255 nodes. I am using 255 nodes 
with one MPI task per node and the mpiexec looks as follows:

mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out

After ten minutes, I pulled a stracktrace on all nodes and killed the job, 
because there was no progress. In the following, you will find the stack trace 
generated with gdb thread apply all bt. The backtrace looks basically the same 
on all nodes. It seems to hang in mpi_init.

Any help is appreciated,

Thorsten

Thread 3 (Thread 46914544122176 (LWP 28979)):
#0  0x2b6ee912d9a2 in select () from /lib64/libc.so.6
#1  0x2b6eeabd928d in service_thread_start (context=) 
at btl_openib_fd.c:427
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 2 (Thread 46916594338112 (LWP 28980)):
#0  0x2b6ee912b8b6 in poll () from /lib64/libc.so.6
#1  0x2b6eeabd7b8a in btl_openib_async_thread (async=) at btl_openib_async.c:419
#2  0x2b6ee835e143 in start_thread () from /lib64/libpthread.so.0
#3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
#4  0x in ?? ()

Thread 1 (Thread 47755361533088 (LWP 28978)):
#0  0x2b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6
#1  0x2b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0, 
tv=) at epoll.c:215
#2  0x2b6ee8773309 in opal_event_base_loop (base=0xb79050, flags=) at event.c:838
#3  0x2b6ee875ee92 in opal_progress () at runtime/opal_progress.c:189
#4  0x39f1 in ?? ()
#5  0x2b6ee87979c9 in std::ios_base::Init::~Init () at 
../../.././libstdc++-v3/src/ios_init.cc:123
#6  0x7fffc32c8cc8 in ?? ()
#7  0x2b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=, attribute_name=0x2b6ee88e5780 " \020322351n+", 
val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
#8  0x2b6ee86dd511 in ompi_modex_recv_key_value (key=, source_proc=, value=0xbb3a00, dtype=14 '\016') at 
runtime/ompi_module_exchange.c:125
#9  0x2b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
#10 0x2b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8, 
requested=, provided=0x7fffc32c917c) at 
runtime/ompi_mpi_init.c:699
#11 0x7fffc32c8e88 in ?? ()
#12 0x2b6ee77f8348 in ?? ()
#13 0x7fffc32c8e60 in ?? ()
#14 0x7fffc32c8e20 in ?? ()
#15 0x09efa994 in ?? ()
#16 0x in ?? ()


Re: [OMPI users] Building OpenMPI v. 1.4.3 in VS2008

2011-06-21 Thread Shiqing Fan


Hi Alan,

I was able to test it again on a machine that has a VS2008 installed. 
But everything worked just fine for me. I looked into the generated 
config file(build_dir/opal/include/opal_config.h), and the CMake build 
system didn't find stdint.h, but it still compiled.


So it was probably some other issues on your platform. It would be very 
helpful for me to figure out the problem, if you can provide more 
information, e.g. configure log, compilation error messages and so on.



Regards,
Shiqing

On 2011-06-10 8:34 PM, Alan Nichols wrote:


Hi Shiquing,

OK I'll give this a try... however, I realized after some Google 
searching in the aftermath of my previous attempt to build on VS2008 
that the file that I'm missing on that platform is shipped with VS2010.


So I suspect that building on VS2010 will go smoothly as you said. My 
problem is that my current effort is part of a much larger project 
that is being built on VS2008. On the one hand I don't want at all to 
shift that larger code base from VS2008 to VS2010 (and fight the 
numerous problems that always follow an upheaval of that sort); on the 
other hand I'm dubious about trying to build my parallel support 
library on VS2010 and the rest of the code  on VS2008.


Is there a way to do what I really want to do, which is build the 
openmpi source on VS2008?


Alan Nichols

AWR - STAAR

11520 N. Port Washington Rd.

Mequon, WI 53092

P: 1.262.240.0291 x 103

F: 1.262.240.0294

E: anich...@awrcorp.com 

http://www.awrcorp.com 

*From:*Shiqing Fan [mailto:f...@hlrs.de]
*Sent:* Thursday, June 09, 2011 6:43 PM
*To:* Open MPI Users
*Cc:* Alan Nichols
*Subject:* Re: [OMPI users] Building OpenMPI v. 1.4.3 in VS2008


Hi Alan,

It looks like a problem of using a wrong generator in CMake GUI. I 
double tested a fresh new downloaded 1.4.3 on my win7 machine with 
VS2010, everything worked well.


Please check:
1.  a proper CMake generator is used.
2.  the CMAKE_BUILD_TYPE in CMake GUI and the build type in VS are 
both Release


If the error still happens, please provide me the file name and  line 
number where triggers the error when compiling it.


Regards,
Shiqing

On 2011-06-07 5:37 PM, Alan Nichols wrote:

Hello,

I'm currently trying to build OpenMPI v. 1.4.3 from source, in 
VS2008.  Platform is Win7, SP1 installed ( I realize that this is 
possibly not an ideal approach as v. 1.5.3 has installers for Windows 
binaries.  However for compatibility with other programs I need to use 
v. 1.4.3 if at all possible;  also as I have many other libraries 
build under VS2008, I need to use the VS2008 compiler if at all possible).


Following the README.WINDOWS file I found, I used CMake to build a 
Windows .sln file.  I accepted the default CMake settings, with the 
exception that I only created a Release build of OpenMPI.  Upon my 
first attempt to build the solution, I got an error about a missing 
file stdint.h.  I was able to fix this by including the stdint.h from 
VS2010.  However I now get new errors referencing


__attribute__((__always_inline__))

__asmvolatile__("": : :"memory")

These look to me like linux-specific problems -- is it even possible 
to do what I'm attempting, or are the code bases and compiler 
fundamentally at odds here?  If it is possible can you explain where 
my error lies?


Thanks for your help,

Alan Nichols

  
  
___

users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email:f...@hlrs.de  



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de