Hi,
The debugger traces are captured when the different tasks are blocked.
Before the MPI_COMM_DUP, a MPI_undefined color has been affected to the
master process and a MPI_COMM_SPLIT construct a new communicator not
containing the master.
The master process doesn't call the MPI_COMM_DUP routine, and so the
master is not blocked at this level of instruction, but further in the
program, at a barrier call. The master process behaviour is normal, it
wait for the slaves which are all blocked in MPI_COMM_DUP.
Here is the MUMPS portion of code (in zmumps_part1.F file) where the
slaves call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before :
CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
IF ( id%PAR .eq. 0 ) THEN
IF ( id%MYID .eq. MASTER ) THEN
color = MPI_UNDEFINED
ELSE
color = 0
END IF
CALL MPI_COMM_SPLIT( id%COMM, color, 0,
& id%COMM_NODES, IERR )
id%NSLAVES = id%NPROCS - 1
ELSE
CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
id%NSLAVES = id%NPROCS
END IF
IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
ENDIF
------
In our case (id%PAR = 0), only the second MPI_COMM_DUP call is executed
on the slaves.
MUMPS library and our program are compiled with intel fortran 12 and I
have test -O1 option with no more success.
Franc,oise.
George Bosilca wrote:
On May 10, 2011, at 08:10 , Tim Prince wrote:
On 5/10/2011 6:43 AM, francoise.r...@obs.ujf-grenoble.fr wrote:
Hi,
I compile a parallel program with OpenMPI 1.4.1 (compiled with intel
compilers 12 from composerxe package) . This program is linked to MUMPS
library 4.9.2, compiled with the same compilers and link with intel MKL.
The OS is linux debian.
No error in compiling or running the job, but the program freeze inside
a call to "zmumps" routine, when the slaves process call MPI_COMM_DUP
routine.
The program is executed on 2 nodes of 12 cores each (westmere
processors) with the following command :
mpirun -np 24 --machinefile $OAR_NODE_FILE -mca plm_rsh_agent "oarsh"
--mca btl self,openib -x LD_LIBRARY_PATH ./prog
We have 12 process running on each node. We submit the job with OAR
batch scheduler (the $OAR_NODE_FILE variable and "oarsh" command are
specific to this scheduler and are usually working well with openmpi )
via gdb, on the slaves, we can see that they are blocked in MPI_COMM_DUP :
Francoise,
Based on your traces the workers and the master are not doing the same MPI
call. The workers are blocked in an MPI_Comm_dup in sub_pbdirect_init.f90:44,
while the master is blocked in an MPI_Barrier in sub_pbdirect_init.f90:62. Can
you verify that the slaves and the master are calling the MPI_Barrier and the
MPI_Comm_dup in the same logical order?
george.
(gdb) where
#0 0x00002b32c1533113 in poll () from /lib/libc.so.6
#1 0x0000000000adf52c in poll_dispatch ()
#2 0x0000000000adcea3 in opal_event_loop ()
#3 0x0000000000ad69f9 in opal_progress ()
#4 0x0000000000a34b4e in mca_pml_ob1_recv ()
#5 0x00000000009b0768 in
ompi_coll_tuned_allreduce_intra_recursivedoubling ()
#6 0x00000000009ac829 in ompi_coll_tuned_allreduce_intra_dec_fixed ()
#7 0x000000000097e271 in ompi_comm_allreduce_intra ()
#8 0x000000000097dd06 in ompi_comm_nextcid ()
#9 0x000000000097be01 in ompi_comm_dup ()
#10 0x00000000009a0785 in PMPI_Comm_dup ()
#11 0x000000000097931d in pmpi_comm_dup__ ()
#12 0x0000000000644251 in zmumps (id=...) at zmumps_part1.F:144
#13 0x00000000004c0d03 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:44
#14 0x0000000000628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048
the master wait further :
(gdb) where
#0 0x00002b9dc9f3e113 in poll () from /lib/libc.so.6
#1 0x0000000000adf52c in poll_dispatch ()
#2 0x0000000000adcea3 in opal_event_loop ()
#3 0x0000000000ad69f9 in opal_progress ()
#4 0x000000000098f294 in ompi_request_default_wait_all ()
#5 0x0000000000a06e56 in ompi_coll_tuned_sendrecv_actual ()
#6 0x00000000009ab8e3 in ompi_coll_tuned_barrier_intra_bruck ()
#7 0x00000000009ac926 in ompi_coll_tuned_barrier_intra_dec_fixed ()
#8 0x00000000009a0b20 in PMPI_Barrier ()
#9 0x0000000000978c93 in pmpi_barrier__ ()
#10 0x00000000004c0dc4 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:62
#11 0x0000000000628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048
Remark :
The same code compiled and run well with intel MPI library, from the
same intel package, on the same nodes.
Did you try compiling with equivalent options in each compiler? For example,
(supposing you had gcc 4.6)
gcc -O3 -funroll-loops --param max-unroll-times=2 -march=corei7
would be equivalent (as closely as I know) to
icc -fp-model source -msse4.2 -ansi-alias
As you should be aware, default settings in icc are more closely equivalent to
gcc -O3 -ffast-math -fno-cx-limited-range -funroll-loops --param
max-unroll-times=2 -fnostrict-aliasing
The options I suggest as an upper limit are probably more aggressive than most
people have used successfully with OpenMPI.
As to run-time MPI options, Intel MPI has affinity with Westmere awareness
turned on by default. I suppose testing without affinity settings,
particularly when banging against all hyperthreads, is a more severe test of
your application. Don't you get better results at 1 rank per core?
--
Tim Prince
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
"To preserve the freedom of the human mind then and freedom of the press, every
spirit should be ready to devote itself to martyrdom; for as long as we may think as we
will, and speak as we think, the condition of man will proceed in improvement."
-- Thomas Jefferson, 1799
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users