Ralph,
I can't get "opal_paffinity_alone" to work (see below). However, there
is a "mpi_affinity_alone" that I tried without any improvement.
However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous 376
seconds). Still a lot more than ScaliMPI with 214 seconds.
Looking at the profile data my gut feeling is that the performance
suffers due to the frequent calls to MPI_IPROBE. I will look at this and
count the number of calls but it could easily be 10 times more calls to
MPI_IPROBE thanto MPI_BSEND.
/Torgny
n70 462% ompi_info --param all all | grep opal
MCA opal: parameter "opal_signal" (current value:
"6,7,8,11", data source: default value)
MCA opal: parameter "opal_set_max_sys_limits" (current
value: "0", data source: default value)
MCA opal: parameter "opal_event_include" (current value:
"poll", data source: default value)
n70 463% ompi_info --param all all | grep paffinity
MCA mpi: parameter "mpi_paffinity_alone" (current
value: "0", data source: default value)
MCA paffinity: parameter "paffinity_base_verbose" (current
value: "0", data source: default value)
Verbosity level of the paffinity framework
MCA paffinity: parameter "paffinity" (current value: <none>,
data source: default value)
Default selection set of components for the
paffinity framework (<none> means use all components that can be found)
MCA paffinity: parameter "paffinity_linux_priority" (current
value: "10", data source: default value)
Priority of the linux paffinity component
MCA paffinity: information "paffinity_linux_plpa_version"
(value: "1.2rc2", data source: default value)
Ralph Castain wrote:
Okay, one problem is fairly clear. As Terry indicated, you have to
tell us to bind or else you lose a lot of performace. Set -mca
opal_paffinity_alone 1 on your cmd line and it should make a
significant difference.
On Wed, Aug 5, 2009 at 8:10 AM, Torgny Faxen <fa...@nsc.liu.se
<mailto:fa...@nsc.liu.se>> wrote:
Ralph,
I am running through a locally provided wrapper but it translates to:
/software/mpi/openmpi/1.3b2/i101017/bin/mpirun -np 144 -npernode 8
-mca mpi_show_mca_params env,file /nobac
kup/rossby11/faxen/RCO_scobi/src_161.openmpi/rco2.24pe
a) Upgrade.. This will take some time, it will have to go through
the administrator, this is a production cluster
b) -mca .. see output below
c) I used exactly the same optimization flags for all three
versions (ScaliMPI, OpenMPI and IntelMPI) and this is Fortran so I
am using mpif90 :-)
Regards / Torgny
[n70:30299] ess=env (environment)
[n70:30299] orte_ess_jobid=482607105 (environment)
[n70:30299] orte_ess_vpid=0 (environment)
[n70:30299] mpi_yield_when_idle=0 (environment)
[n70:30299] mpi_show_mca_params=env,file (environment)
Ralph Castain wrote:
Could you send us the mpirun cmd line? I wonder if you are
missing some options that could help. Also, you might:
(a) upgrade to 1.3.3 - it looks like you are using some kind
of pre-release version
(b) add -mca mpi_show_mca_params env,file - this will cause
rank=0 to output what mca params it sees, and where they came from
(c) check that you built a non-debug version, and remembered
to compile your application with a -O3 flag - i.e., "mpicc -O3
...". Remember, OMPI does not automatically add optimization
flags to mpicc!
Thanks
Ralph
On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen <fa...@nsc.liu.se
<mailto:fa...@nsc.liu.se> <mailto:fa...@nsc.liu.se
<mailto:fa...@nsc.liu.se>>> wrote:
Pasha,
no collectives are being used.
A simple grep in the code reveals the following MPI functions
being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE
where MPI_IPROBE is the clear winner in terms of number of
calls.
/Torgny
Pavel Shamis (Pasha) wrote:
Do you know if the application use some collective
operations ?
Thanks
Pasha
Torgny Faxen wrote:
Hello,
we are seeing a large difference in performance for
some
applications depending on what MPI is being used.
Attached are performance numbers and oprofile output
(first 30 lines) from one out of 14 nodes from one
application run using OpenMPI, IntelMPI and Scali MPI
respectively.
Scali MPI is faster the other two MPI:s with a
factor of
1.6 and 1.75:
ScaliMPI: walltime for the whole application is 214
seconds
OpenMPI: walltime for the whole application is 376
seconds
Intel MPI: walltime for the whole application is
346 seconds.
The application is running with the main send receive
commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there
being a
message). Quite often MPI_Iprobe is being called
just to
check whether there is a certain message pending.
Any idea on tuning tips, performance analysis, code
modifications to improve the OpenMPI performance? A
lot of
time is being spent in "mca_btl_sm_component_progress",
"btl_openib_component_progress" and other internal
routines.
The code is running on a cluster with 140 HP ProLiant
DL160 G5 compute servers. Infiniband interconnect.
Intel
Xeon E5462 processors. The profiled application is
using
144 cores on 18 nodes over Infiniband.
Regards / Torgny
=====================================================================================================================0
OpenMPI 1.3b2
=====================================================================================================================0
Walltime: 376 seconds
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name
symbol name
668288 22.2113 mca_btl_sm.so
rco2.24pe mca_btl_sm_component_progress
441828 14.6846 rco2.24pe
rco2.24pe step_
335929 11.1650 libmlx4-rdmav2.so
rco2.24pe (no symbols)
301446 10.0189 mca_btl_openib.so
rco2.24pe btl_openib_component_progress
161033 5.3521 libopen-pal.so.0.0.0
rco2.24pe opal_progress
157024 5.2189 libpthread-2.5.so
<http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2.24pe
pthread_spin_lock
99526 3.3079 no-vmlinux
no-vmlinux (no symbols)
93887 3.1204 mca_btl_sm.so
rco2.24pe opal_using_threads
69979 2.3258 mca_pml_ob1.so
rco2.24pe mca_pml_ob1_iprobe
58895 1.9574 mca_bml_r2.so
rco2.24pe mca_bml_r2_progress
55095 1.8311 mca_pml_ob1.so
rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381 rco2.24pe
rco2.24pe tracer_
41946 1.3941 libintlc.so.5
rco2.24pe __intel_new_memcpy
40730 1.3537 rco2.24pe
rco2.24pe scobi_
36586 1.2160 rco2.24pe
rco2.24pe state_
20986 0.6975 rco2.24pe
rco2.24pe diag_
19321 0.6422 libmpi.so.0.0.0
rco2.24pe PMPI_Unpack
18552 0.6166 libmpi.so.0.0.0
rco2.24pe PMPI_Iprobe
17323 0.5757 rco2.24pe
rco2.24pe clinic_
16194 0.5382 rco2.24pe
rco2.24pe k_epsi_
15330 0.5095 libmpi.so.0.0.0
rco2.24pe PMPI_Comm_f2c
13778 0.4579 libmpi_f77.so.0.0.0
rco2.24pe mpi_iprobe_f
13241 0.4401 rco2.24pe
rco2.24pe s_recv_
12386 0.4117 rco2.24pe
rco2.24pe growth_
11699 0.3888 rco2.24pe
rco2.24pe testnrecv_
11268 0.3745 libmpi.so.0.0.0
rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646 libmpi.so.0.0.0
rco2.24pe ompi_convertor_unpack
10034 0.3335 mca_pml_ob1.so
rco2.24pe
mca_pml_ob1_recv_request_match_specific
10003 0.3325 libimf.so
rco2.24pe exp.L
9375 0.3116 rco2.24pe
rco2.24pe subbasin_
8912 0.2962 libmpi_f77.so.0.0.0
rco2.24pe mpi_unpack_f
=====================================================================================================================0
Intel MPI, version 3.2.0.011/ <http://3.2.0.011/>
<http://3.2.0.011/>
=====================================================================================================================0
Walltime: 346 seconds
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name
symbol name
486712 17.7537 rco2 rco2
step_
431941 15.7558 no-vmlinux
no-vmlinux (no symbols)
212425 7.7486 libmpi.so.3.2 rco2
MPIDI_CH3U_Recvq_FU
188975 6.8932 libmpi.so.3.2 rco2
MPIDI_CH3I_RDSSM_Progress
172855 6.3052 libmpi.so.3.2 rco2
MPIDI_CH3I_read_progress
121472 4.4309 libmpi.so.3.2 rco2
MPIDI_CH3I_SHM_read_progress
64492 2.3525 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2
sched_yield
52372 1.9104 rco2 rco2
tracer_
48621 1.7735 libmpi.so.3.2 rco2
.plt
45475 1.6588 libmpiif.so.3.2 rco2
pmpi_iprobe__
44082 1.6080 libmpi.so.3.2 rco2
MPID_Iprobe
42788 1.5608 libmpi.so.3.2 rco2
MPIDI_CH3_Stop_recv
42754 1.5595 libpthread-2.5.so
<http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2
pthread_mutex_lock
42190 1.5390 libmpi.so.3.2 rco2
PMPI_Iprobe
41577 1.5166 rco2 rco2
scobi_
40356 1.4721 libmpi.so.3.2 rco2
MPIDI_CH3_Start_recv
38582 1.4073 libdaplcma.so.1.0.2 rco2
(no symbols)
37545 1.3695 rco2 rco2
state_
35597 1.2985 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2
free
34019 1.2409 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2
malloc
31841 1.1615 rco2 rco2
s_recv_
30955 1.1291 libmpi.so.3.2 rco2
__I_MPI___intel_new_memcpy
27876 1.0168 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2
_int_malloc
26963 0.9835 rco2 rco2
testnrecv_
23005 0.8391 libpthread-2.5.so
<http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2
__pthread_mutex_unlock_usercnt
22290 0.8131 libmpi.so.3.2 rco2
MPID_Segment_manipulate
22086 0.8056 libmpi.so.3.2 rco2
MPIDI_CH3I_read_progress_expected
19146 0.6984 rco2 rco2
diag_
18250 0.6657 rco2 rco2
clinic_
=====================================================================================================================0
Scali MPI, version 3.13.10-59413
=====================================================================================================================0
Walltime:
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name
symbol name
484267 30.0664 rco2.24pe
rco2.24pe step_
111949 6.9505 libmlx4-rdmav2.so
rco2.24pe (no symbols)
73930 4.5900 libmpi.so
rco2.24pe scafun_rq_handle_body
57846 3.5914 libmpi.so
rco2.24pe invert_decode_header
55836 3.4667 libpthread-2.5.so
<http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2.24pe
pthread_spin_lock
53703 3.3342 rco2.24pe
rco2.24pe tracer_
40934 2.5414 rco2.24pe
rco2.24pe scobi_
40244 2.4986 libmpi.so
rco2.24pe scafun_request_probe_handler
37399 2.3220 rco2.24pe
rco2.24pe state_
30455 1.8908 libmpi.so
rco2.24pe invert_matchandprobe
29707 1.8444 no-vmlinux
no-vmlinux (no symbols)
29147 1.8096 libmpi.so
rco2.24pe FMPI_scafun_Iprobe
27969 1.7365 libmpi.so
rco2.24pe decode_8_u_64
27475 1.7058 libmpi.so
rco2.24pe scafun_rq_anysrc_fair_one
25966 1.6121 libmpi.so
rco2.24pe scafun_uxq_probe
24380 1.5137 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2.24pe
memcpy
22615 1.4041 libmpi.so
rco2.24pe .plt
21172 1.3145 rco2.24pe
rco2.24pe diag_
20716 1.2862 libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2.24pe
memset
18565 1.1526 libmpi.so
rco2.24pe openib_wrapper_poll_cq
18192 1.1295 rco2.24pe
rco2.24pe clinic_
17135 1.0638 libmpi.so
rco2.24pe PMPI_Iprobe
16685 1.0359 rco2.24pe
rco2.24pe k_epsi_
16236 1.0080 libmpi.so
rco2.24pe PMPI_Unpack
15563 0.9662 libmpi.so
rco2.24pe scafun_r_rq_append
14829 0.9207 libmpi.so
rco2.24pe scafun_rq_test_finished
13349 0.8288 rco2.24pe
rco2.24pe s_recv_
12490 0.7755 libmpi.so
rco2.24pe flop_matchandprobe
12427 0.7715 libibverbs.so.1.0.0
rco2.24pe (no symbols)
12272 0.7619 libmpi.so
rco2.24pe scafun_rq_handle
12146 0.7541 rco2.24pe
rco2.24pe growth_
10175 0.6317 libmpi.so
rco2.24pe wrp2p_test_finished
9888 0.6139 libimf.so
rco2.24pe exp.L
9179 0.5699 rco2.24pe
rco2.24pe subbasin_
9082 0.5639 rco2.24pe
rco2.24pe testnrecv_
8901 0.5526 libmpi.so
rco2.24pe openib_wrapper_purge_requests
7425 0.4610 rco2.24pe
rco2.24pe scobimain_
7378 0.4581 rco2.24pe
rco2.24pe scobi_interface_
6530 0.4054 rco2.24pe
rco2.24pe setvbc_
6471 0.4018 libfmpi.so
rco2.24pe pmpi_iprobe
6341 0.3937 rco2.24pe
rco2.24pe snap_
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
http://www.open-mpi.org/mailman/listinfo.cgi/users
-- ---------------------------------------------------------
Torgny Faxén National Supercomputer Center
Linköping University S-581 83 Linköping
Sweden
Email:fa...@nsc.liu.se <mailto:email%3afa...@nsc.liu.se>
<mailto:email%3afa...@nsc.liu.se
<mailto:email%253afa...@nsc.liu.se>>
Telephone: +46 13 285798 (office) +46 13 282535 (fax)
http://www.nsc.liu.se
---------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
---------------------------------------------------------
Torgny Faxén
National Supercomputer Center
Linköping University
S-581 83 Linköping
Sweden
Email:fa...@nsc.liu.se <mailto:email%3afa...@nsc.liu.se>
Telephone: +46 13 285798 (office) +46 13 282535 (fax)
http://www.nsc.liu.se
---------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
---------------------------------------------------------
Torgny Faxén
National Supercomputer Center
Linköping University
S-581 83 Linköping
Sweden
Email:fa...@nsc.liu.se
Telephone: +46 13 285798 (office) +46 13 282535 (fax)
http://www.nsc.liu.se
---------------------------------------------------------