Could you send us the mpirun cmd line? I wonder if you are missing some options that could help. Also, you might:
(a) upgrade to 1.3.3 - it looks like you are using some kind of pre-release version (b) add -mca mpi_show_mca_params env,file - this will cause rank=0 to output what mca params it sees, and where they came from (c) check that you built a non-debug version, and remembered to compile your application with a -O3 flag - i.e., "mpicc -O3 ...". Remember, OMPI does not automatically add optimization flags to mpicc! Thanks Ralph On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen <fa...@nsc.liu.se> wrote: > Pasha, > no collectives are being used. > > A simple grep in the code reveals the following MPI functions being used: > MPI_Init > MPI_wtime > MPI_COMM_RANK > MPI_COMM_SIZE > MPI_BUFFER_ATTACH > MPI_BSEND > MPI_PACK > MPI_UNPACK > MPI_PROBE > MPI_GET_COUNT > MPI_RECV > MPI_IPROBE > MPI_FINALIZE > > where MPI_IPROBE is the clear winner in terms of number of calls. > > /Torgny > > > Pavel Shamis (Pasha) wrote: > >> Do you know if the application use some collective operations ? >> >> Thanks >> >> Pasha >> >> Torgny Faxen wrote: >> >>> Hello, >>> we are seeing a large difference in performance for some applications >>> depending on what MPI is being used. >>> >>> Attached are performance numbers and oprofile output (first 30 lines) >>> from one out of 14 nodes from one application run using OpenMPI, IntelMPI >>> and Scali MPI respectively. >>> >>> Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75: >>> >>> ScaliMPI: walltime for the whole application is 214 seconds >>> OpenMPI: walltime for the whole application is 376 seconds >>> Intel MPI: walltime for the whole application is 346 seconds. >>> >>> The application is running with the main send receive commands being: >>> MPI_Bsend >>> MPI_Iprobe followed by MPI_Recv (in case of there being a message). Quite >>> often MPI_Iprobe is being called just to check whether there is a certain >>> message pending. >>> >>> Any idea on tuning tips, performance analysis, code modifications to >>> improve the OpenMPI performance? A lot of time is being spent in >>> "mca_btl_sm_component_progress", "btl_openib_component_progress" and other >>> internal routines. >>> >>> The code is running on a cluster with 140 HP ProLiant DL160 G5 compute >>> servers. Infiniband interconnect. Intel Xeon E5462 processors. The profiled >>> application is using 144 cores on 18 nodes over Infiniband. >>> >>> Regards / Torgny >>> =====================================================================================================================0 >>> >>> OpenMPI 1.3b2 >>> =====================================================================================================================0 >>> >>> >>> Walltime: 376 seconds >>> >>> CPU: CPU with timer interrupt, speed 0 MHz (estimated) >>> Profiling through timer interrupt >>> samples % image name app name >>> symbol name >>> 668288 22.2113 mca_btl_sm.so rco2.24pe >>> mca_btl_sm_component_progress >>> 441828 14.6846 rco2.24pe rco2.24pe step_ >>> 335929 11.1650 libmlx4-rdmav2.so rco2.24pe (no >>> symbols) >>> 301446 10.0189 mca_btl_openib.so rco2.24pe >>> btl_openib_component_progress >>> 161033 5.3521 libopen-pal.so.0.0.0 rco2.24pe >>> opal_progress >>> 157024 5.2189 libpthread-2.5.so rco2.24pe >>> pthread_spin_lock >>> 99526 3.3079 no-vmlinux no-vmlinux (no >>> symbols) >>> 93887 3.1204 mca_btl_sm.so rco2.24pe >>> opal_using_threads >>> 69979 2.3258 mca_pml_ob1.so rco2.24pe >>> mca_pml_ob1_iprobe >>> 58895 1.9574 mca_bml_r2.so rco2.24pe >>> mca_bml_r2_progress >>> 55095 1.8311 mca_pml_ob1.so rco2.24pe >>> mca_pml_ob1_recv_request_match_wild >>> 49286 1.6381 rco2.24pe rco2.24pe >>> tracer_ >>> 41946 1.3941 libintlc.so.5 rco2.24pe >>> __intel_new_memcpy >>> 40730 1.3537 rco2.24pe rco2.24pe >>> scobi_ >>> 36586 1.2160 rco2.24pe rco2.24pe >>> state_ >>> 20986 0.6975 rco2.24pe rco2.24pe diag_ >>> 19321 0.6422 libmpi.so.0.0.0 rco2.24pe >>> PMPI_Unpack >>> 18552 0.6166 libmpi.so.0.0.0 rco2.24pe >>> PMPI_Iprobe >>> 17323 0.5757 rco2.24pe rco2.24pe >>> clinic_ >>> 16194 0.5382 rco2.24pe rco2.24pe >>> k_epsi_ >>> 15330 0.5095 libmpi.so.0.0.0 rco2.24pe >>> PMPI_Comm_f2c >>> 13778 0.4579 libmpi_f77.so.0.0.0 rco2.24pe >>> mpi_iprobe_f >>> 13241 0.4401 rco2.24pe rco2.24pe >>> s_recv_ >>> 12386 0.4117 rco2.24pe rco2.24pe >>> growth_ >>> 11699 0.3888 rco2.24pe rco2.24pe >>> testnrecv_ >>> 11268 0.3745 libmpi.so.0.0.0 rco2.24pe >>> mca_pml_base_recv_request_construct >>> 10971 0.3646 libmpi.so.0.0.0 rco2.24pe >>> ompi_convertor_unpack >>> 10034 0.3335 mca_pml_ob1.so rco2.24pe >>> mca_pml_ob1_recv_request_match_specific >>> 10003 0.3325 libimf.so rco2.24pe exp.L >>> 9375 0.3116 rco2.24pe rco2.24pe >>> subbasin_ >>> 8912 0.2962 libmpi_f77.so.0.0.0 rco2.24pe >>> mpi_unpack_f >>> >>> >>> >>> =====================================================================================================================0 >>> >>> Intel MPI, version 3.2.0.011/ >>> =====================================================================================================================0 >>> >>> >>> Walltime: 346 seconds >>> >>> CPU: CPU with timer interrupt, speed 0 MHz (estimated) >>> Profiling through timer interrupt >>> samples % image name app name >>> symbol name >>> 486712 17.7537 rco2 rco2 step_ >>> 431941 15.7558 no-vmlinux no-vmlinux (no >>> symbols) >>> 212425 7.7486 libmpi.so.3.2 rco2 >>> MPIDI_CH3U_Recvq_FU >>> 188975 6.8932 libmpi.so.3.2 rco2 >>> MPIDI_CH3I_RDSSM_Progress >>> 172855 6.3052 libmpi.so.3.2 rco2 >>> MPIDI_CH3I_read_progress >>> 121472 4.4309 libmpi.so.3.2 rco2 >>> MPIDI_CH3I_SHM_read_progress >>> 64492 2.3525 libc-2.5.so rco2 >>> sched_yield >>> 52372 1.9104 rco2 rco2 >>> tracer_ >>> 48621 1.7735 libmpi.so.3.2 rco2 .plt >>> 45475 1.6588 libmpiif.so.3.2 rco2 >>> pmpi_iprobe__ >>> 44082 1.6080 libmpi.so.3.2 rco2 >>> MPID_Iprobe >>> 42788 1.5608 libmpi.so.3.2 rco2 >>> MPIDI_CH3_Stop_recv >>> 42754 1.5595 libpthread-2.5.so rco2 >>> pthread_mutex_lock >>> 42190 1.5390 libmpi.so.3.2 rco2 >>> PMPI_Iprobe >>> 41577 1.5166 rco2 rco2 >>> scobi_ >>> 40356 1.4721 libmpi.so.3.2 rco2 >>> MPIDI_CH3_Start_recv >>> 38582 1.4073 libdaplcma.so.1.0.2 rco2 (no >>> symbols) >>> 37545 1.3695 rco2 rco2 >>> state_ >>> 35597 1.2985 libc-2.5.so rco2 free >>> 34019 1.2409 libc-2.5.so rco2 >>> malloc >>> 31841 1.1615 rco2 rco2 >>> s_recv_ >>> 30955 1.1291 libmpi.so.3.2 rco2 >>> __I_MPI___intel_new_memcpy >>> 27876 1.0168 libc-2.5.so rco2 >>> _int_malloc >>> 26963 0.9835 rco2 rco2 >>> testnrecv_ >>> 23005 0.8391 libpthread-2.5.so rco2 >>> __pthread_mutex_unlock_usercnt >>> 22290 0.8131 libmpi.so.3.2 rco2 >>> MPID_Segment_manipulate >>> 22086 0.8056 libmpi.so.3.2 rco2 >>> MPIDI_CH3I_read_progress_expected >>> 19146 0.6984 rco2 rco2 diag_ >>> 18250 0.6657 rco2 rco2 >>> clinic_ >>> =====================================================================================================================0 >>> >>> Scali MPI, version 3.13.10-59413 >>> =====================================================================================================================0 >>> >>> >>> Walltime: >>> >>> CPU: CPU with timer interrupt, speed 0 MHz (estimated) >>> Profiling through timer interrupt >>> samples % image name app name >>> symbol name >>> 484267 30.0664 rco2.24pe rco2.24pe step_ >>> 111949 6.9505 libmlx4-rdmav2.so rco2.24pe (no >>> symbols) >>> 73930 4.5900 libmpi.so rco2.24pe >>> scafun_rq_handle_body >>> 57846 3.5914 libmpi.so rco2.24pe >>> invert_decode_header >>> 55836 3.4667 libpthread-2.5.so rco2.24pe >>> pthread_spin_lock >>> 53703 3.3342 rco2.24pe rco2.24pe >>> tracer_ >>> 40934 2.5414 rco2.24pe rco2.24pe >>> scobi_ >>> 40244 2.4986 libmpi.so rco2.24pe >>> scafun_request_probe_handler >>> 37399 2.3220 rco2.24pe rco2.24pe >>> state_ >>> 30455 1.8908 libmpi.so rco2.24pe >>> invert_matchandprobe >>> 29707 1.8444 no-vmlinux no-vmlinux (no >>> symbols) >>> 29147 1.8096 libmpi.so rco2.24pe >>> FMPI_scafun_Iprobe >>> 27969 1.7365 libmpi.so rco2.24pe >>> decode_8_u_64 >>> 27475 1.7058 libmpi.so rco2.24pe >>> scafun_rq_anysrc_fair_one >>> 25966 1.6121 libmpi.so rco2.24pe >>> scafun_uxq_probe >>> 24380 1.5137 libc-2.5.so rco2.24pe >>> memcpy >>> 22615 1.4041 libmpi.so rco2.24pe .plt >>> 21172 1.3145 rco2.24pe rco2.24pe diag_ >>> 20716 1.2862 libc-2.5.so rco2.24pe >>> memset >>> 18565 1.1526 libmpi.so rco2.24pe >>> openib_wrapper_poll_cq >>> 18192 1.1295 rco2.24pe rco2.24pe >>> clinic_ >>> 17135 1.0638 libmpi.so rco2.24pe >>> PMPI_Iprobe >>> 16685 1.0359 rco2.24pe rco2.24pe >>> k_epsi_ >>> 16236 1.0080 libmpi.so rco2.24pe >>> PMPI_Unpack >>> 15563 0.9662 libmpi.so rco2.24pe >>> scafun_r_rq_append >>> 14829 0.9207 libmpi.so rco2.24pe >>> scafun_rq_test_finished >>> 13349 0.8288 rco2.24pe rco2.24pe >>> s_recv_ >>> 12490 0.7755 libmpi.so rco2.24pe >>> flop_matchandprobe >>> 12427 0.7715 libibverbs.so.1.0.0 rco2.24pe (no >>> symbols) >>> 12272 0.7619 libmpi.so rco2.24pe >>> scafun_rq_handle >>> 12146 0.7541 rco2.24pe rco2.24pe >>> growth_ >>> 10175 0.6317 libmpi.so rco2.24pe >>> wrp2p_test_finished >>> 9888 0.6139 libimf.so rco2.24pe exp.L >>> 9179 0.5699 rco2.24pe rco2.24pe >>> subbasin_ >>> 9082 0.5639 rco2.24pe rco2.24pe >>> testnrecv_ >>> 8901 0.5526 libmpi.so rco2.24pe >>> openib_wrapper_purge_requests >>> 7425 0.4610 rco2.24pe rco2.24pe >>> scobimain_ >>> 7378 0.4581 rco2.24pe rco2.24pe >>> scobi_interface_ >>> 6530 0.4054 rco2.24pe rco2.24pe >>> setvbc_ >>> 6471 0.4018 libfmpi.so rco2.24pe >>> pmpi_iprobe >>> 6341 0.3937 rco2.24pe rco2.24pe snap_ >>> >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > -- > --------------------------------------------------------- > Torgny Faxén > National Supercomputer Center > Linköping University > S-581 83 Linköping > Sweden > > Email:fa...@nsc.liu.se <email%3afa...@nsc.liu.se> > Telephone: +46 13 285798 (office) +46 13 282535 (fax) > http://www.nsc.liu.se > --------------------------------------------------------- > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >