On 05/19/2011 07:37 PM, Jeff Squyres wrote:
Other users have seen something similar but we have never been able
to reproduce it.  Is this only when using IB?

Actually no, when I use --mca btl tcp,sm,self it hangs in the same place.

If you use "mpirun --mca btl_openib_cpc_if_include rdmacm", does the
problem go away?

No, that doesn't help with the hang I'm seeing. Though it sounds like I'm hitting a different issue than Salvatore, fwiw.

-Marcus


On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:

I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives 
hangs go away. I don't know what, if anything, the higher optimization buys you 
when compiling openmpi, so I'm not sure if that's an acceptable workaround or 
not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 
5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a 
single iteration of Barrier to reproduce the hang, and it happens 100% of the 
time for me when I invoke it like this:

# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the participating 
ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from 
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems to rule out 
the sm btl (or interactions with it) as a culprit at least.

I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:
Dear all,

we succeed in building several version of openmpi from 1.2.8 to 1.4.3
with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending from the
application: IMB, xhpl or user applications
and form the number of required cores) above which the application hangs
(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
There are any known incompatibilities of openmpi with this version of
intel compiilers?

The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to