Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

Torgny Faxen Wed, 5 Aug 2009 11:32:04 -0400

Ralph,

I can't get "opal_paffinity_alone" to work (see below). However, thereis a "mpi_affinity_alone" that I tried without any improvement.


However, setting:
-mca btl_openib_eager_limit 65536

gave a 15% improvement so OpenMPI is now down to 326 (from previous 376seconds). Still a lot more than ScaliMPI with 214 seconds.

Looking at the profile data my gut feeling is that the performancesuffers due to the frequent calls to MPI_IPROBE. I will look at this andcount the number of calls but it could easily be 10 times more calls toMPI_IPROBE thanto MPI_BSEND.


/Torgny

n70 462% ompi_info --param all all | grep opal

MCA opal: parameter "opal_signal" (current value:"6,7,8,11", data source: default value)MCA opal: parameter "opal_set_max_sys_limits" (currentvalue: "0", data source: default value)MCA opal: parameter "opal_event_include" (current value:"poll", data source: default value)

n70 463% ompi_info --param all all | grep paffinity

MCA mpi: parameter "mpi_paffinity_alone" (currentvalue: "0", data source: default value)MCA paffinity: parameter "paffinity_base_verbose" (currentvalue: "0", data source: default value)

                         Verbosity level of the paffinity framework

MCA paffinity: parameter "paffinity" (current value: <none>,data source: default value)Default selection set of components for thepaffinity framework (<none> means use all components that can be found)MCA paffinity: parameter "paffinity_linux_priority" (currentvalue: "10", data source: default value)

                         Priority of the linux paffinity component

MCA paffinity: information "paffinity_linux_plpa_version"(value: "1.2rc2", data source: default value)





Ralph Castain wrote:

Okay, one problem is fairly clear. As Terry indicated, you have totell us to bind or else you lose a lot of performace. Set -mcaopal_paffinity_alone 1 on your cmd line and it should make asignificant difference.
On Wed, Aug 5, 2009 at 8:10 AM, Torgny Faxen <fa...@nsc.liu.se<mailto:fa...@nsc.liu.se>> wrote:
    Ralph,
    I am running through a locally provided wrapper but  it translates to:
    /software/mpi/openmpi/1.3b2/i101017/bin/mpirun -np 144 -npernode 8
    -mca mpi_show_mca_params env,file /nobac
    kup/rossby11/faxen/RCO_scobi/src_161.openmpi/rco2.24pe

    a) Upgrade.. This will take some time, it will have to go through
    the administrator, this is a production cluster
    b) -mca .. see output below
    c) I used exactly the same optimization flags for all three
    versions (ScaliMPI, OpenMPI and IntelMPI) and this is Fortran so I
    am using mpif90 :-)

    Regards / Torgny

    [n70:30299] ess=env (environment)
    [n70:30299] orte_ess_jobid=482607105 (environment)
    [n70:30299] orte_ess_vpid=0 (environment)
    [n70:30299] mpi_yield_when_idle=0 (environment)
    [n70:30299] mpi_show_mca_params=env,file (environment)


    Ralph Castain wrote:

        Could you send us the mpirun cmd line? I wonder if you are
        missing some options that could help. Also, you might:

        (a) upgrade to 1.3.3 - it looks like you are using some kind
        of pre-release version

        (b) add -mca mpi_show_mca_params env,file - this will cause
        rank=0 to output what mca params it sees, and where they came from

        (c) check that you built a non-debug version, and remembered
        to compile your application with a -O3 flag - i.e., "mpicc -O3
        ...". Remember, OMPI does not automatically add optimization
        flags to mpicc!

        Thanks
        Ralph


        On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen <fa...@nsc.liu.se
        <mailto:fa...@nsc.liu.se> <mailto:fa...@nsc.liu.se
        <mailto:fa...@nsc.liu.se>>> wrote:

           Pasha,
           no collectives are being used.

           A simple grep in the code reveals the following MPI functions
           being used:
           MPI_Init
           MPI_wtime
           MPI_COMM_RANK
           MPI_COMM_SIZE
           MPI_BUFFER_ATTACH
           MPI_BSEND
           MPI_PACK
           MPI_UNPACK
           MPI_PROBE
           MPI_GET_COUNT
           MPI_RECV
           MPI_IPROBE
           MPI_FINALIZE

           where MPI_IPROBE is the clear winner in terms of number of
        calls.

           /Torgny


           Pavel Shamis (Pasha) wrote:

               Do you know if the application use some collective
        operations ?

               Thanks

               Pasha

               Torgny Faxen wrote:

                   Hello,
                   we are seeing a large difference in performance for
        some
                   applications depending on what MPI is being used.

                   Attached are performance numbers and oprofile output
                   (first 30 lines) from one out of 14 nodes from one
                   application run using OpenMPI, IntelMPI and Scali MPI
                   respectively.

                   Scali MPI is faster the other two MPI:s with a
        factor of
                   1.6 and 1.75:

                   ScaliMPI: walltime for the whole application is 214
        seconds
                   OpenMPI: walltime for the whole application is 376
        seconds
                   Intel MPI: walltime for the whole application is
        346 seconds.

                   The application is running with the main send receive
                   commands being:
                   MPI_Bsend
                   MPI_Iprobe followed by MPI_Recv (in case of there
        being a
                   message). Quite often MPI_Iprobe is being called
        just to
                   check whether there is a certain message pending.

                   Any idea on tuning tips, performance analysis, code
                   modifications to improve the OpenMPI performance? A
        lot of
                   time is being spent in "mca_btl_sm_component_progress",
                   "btl_openib_component_progress" and other internal
        routines.

                   The code is running on a cluster with 140 HP ProLiant
                   DL160 G5 compute servers. Infiniband interconnect.
        Intel
                   Xeon E5462 processors. The profiled application is
        using
                   144 cores on 18 nodes over Infiniband.

                   Regards / Torgny
=====================================================================================================================0
                   OpenMPI  1.3b2
=====================================================================================================================0
                   Walltime: 376 seconds

                   CPU: CPU with timer interrupt, speed 0 MHz (estimated)
                   Profiling through timer interrupt
                   samples  %        image name               app name
                                   symbol name
668288 22.2113 mca_btl_sm.sorco2.24pe mca_btl_sm_component_progress441828 14.6846 rco2.24perco2.24pe step_335929 11.1650 libmlx4-rdmav2.sorco2.24pe (no symbols)301446 10.0189 mca_btl_openib.sorco2.24pe btl_openib_component_progress161033 5.3521 libopen-pal.so.0.0.0rco2.24pe opal_progress
                   157024    5.2189  libpthread-2.5.so
        <http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2.24pepthread_spin_lock
99526 3.3079 no-vmlinuxno-vmlinux (no symbols)93887 3.1204 mca_btl_sm.sorco2.24pe opal_using_threads69979 2.3258 mca_pml_ob1.sorco2.24pe mca_pml_ob1_iprobe58895 1.9574 mca_bml_r2.sorco2.24pe mca_bml_r2_progress55095 1.8311 mca_pml_ob1.sorco2.24pemca_pml_ob1_recv_request_match_wild49286 1.6381 rco2.24perco2.24pe tracer_41946 1.3941 libintlc.so.5rco2.24pe __intel_new_memcpy40730 1.3537 rco2.24perco2.24pe scobi_36586 1.2160 rco2.24perco2.24pe state_20986 0.6975 rco2.24perco2.24pe diag_19321 0.6422 libmpi.so.0.0.0rco2.24pe PMPI_Unpack18552 0.6166 libmpi.so.0.0.0rco2.24pe PMPI_Iprobe17323 0.5757 rco2.24perco2.24pe clinic_16194 0.5382 rco2.24perco2.24pe k_epsi_15330 0.5095 libmpi.so.0.0.0rco2.24pe PMPI_Comm_f2c13778 0.4579 libmpi_f77.so.0.0.0rco2.24pe mpi_iprobe_f13241 0.4401 rco2.24perco2.24pe s_recv_12386 0.4117 rco2.24perco2.24pe growth_11699 0.3888 rco2.24perco2.24pe testnrecv_11268 0.3745 libmpi.so.0.0.0rco2.24pemca_pml_base_recv_request_construct10971 0.3646 libmpi.so.0.0.0rco2.24pe ompi_convertor_unpack10034 0.3335 mca_pml_ob1.sorco2.24pemca_pml_ob1_recv_request_match_specific10003 0.3325 libimf.sorco2.24pe exp.L9375 0.3116 rco2.24perco2.24pe subbasin_8912 0.2962 libmpi_f77.so.0.0.0rco2.24pe mpi_unpack_f
=====================================================================================================================0
                   Intel MPI, version 3.2.0.011/ <http://3.2.0.011/>
        <http://3.2.0.011/>
=====================================================================================================================0
                   Walltime: 346 seconds

                   CPU: CPU with timer interrupt, speed 0 MHz (estimated)
                   Profiling through timer interrupt
                   samples  %        image name               app name
                                   symbol name
486712 17.7537 rco2 rco2step_431941 15.7558 no-vmlinuxno-vmlinux (no symbols)212425 7.7486 libmpi.so.3.2 rco2MPIDI_CH3U_Recvq_FU188975 6.8932 libmpi.so.3.2 rco2MPIDI_CH3I_RDSSM_Progress172855 6.3052 libmpi.so.3.2 rco2MPIDI_CH3I_read_progress121472 4.4309 libmpi.so.3.2 rco2MPIDI_CH3I_SHM_read_progress
                   64492     2.3525  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2sched_yield
52372 1.9104 rco2 rco2tracer_48621 1.7735 libmpi.so.3.2 rco2.plt45475 1.6588 libmpiif.so.3.2 rco2pmpi_iprobe__44082 1.6080 libmpi.so.3.2 rco2MPID_Iprobe42788 1.5608 libmpi.so.3.2 rco2MPIDI_CH3_Stop_recv
                   42754     1.5595  libpthread-2.5.so
        <http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2pthread_mutex_lock
42190 1.5390 libmpi.so.3.2 rco2PMPI_Iprobe41577 1.5166 rco2 rco2scobi_40356 1.4721 libmpi.so.3.2 rco2MPIDI_CH3_Start_recv38582 1.4073 libdaplcma.so.1.0.2 rco2(no symbols)37545 1.3695 rco2 rco2state_
                   35597     1.2985  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2free
                   34019     1.2409  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2malloc
31841 1.1615 rco2 rco2s_recv_30955 1.1291 libmpi.so.3.2 rco2__I_MPI___intel_new_memcpy
                   27876     1.0168  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2_int_malloc
26963 0.9835 rco2 rco2testnrecv_
                   23005     0.8391  libpthread-2.5.so
        <http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2__pthread_mutex_unlock_usercnt
22290 0.8131 libmpi.so.3.2 rco2MPID_Segment_manipulate22086 0.8056 libmpi.so.3.2 rco2MPIDI_CH3I_read_progress_expected19146 0.6984 rco2 rco2diag_18250 0.6657 rco2 rco2clinic_=====================================================================================================================0
                   Scali MPI, version 3.13.10-59413
=====================================================================================================================0
                   Walltime:

                   CPU: CPU with timer interrupt, speed 0 MHz (estimated)
                   Profiling through timer interrupt
                   samples  %        image name               app name
                                   symbol name
484267 30.0664 rco2.24perco2.24pe step_111949 6.9505 libmlx4-rdmav2.sorco2.24pe (no symbols)73930 4.5900 libmpi.sorco2.24pe scafun_rq_handle_body57846 3.5914 libmpi.sorco2.24pe invert_decode_header
                   55836     3.4667  libpthread-2.5.so
        <http://libpthread-2.5.so>
<http://libpthread-2.5.so> rco2.24pepthread_spin_lock
53703 3.3342 rco2.24perco2.24pe tracer_40934 2.5414 rco2.24perco2.24pe scobi_40244 2.4986 libmpi.sorco2.24pe scafun_request_probe_handler37399 2.3220 rco2.24perco2.24pe state_30455 1.8908 libmpi.sorco2.24pe invert_matchandprobe29707 1.8444 no-vmlinuxno-vmlinux (no symbols)29147 1.8096 libmpi.sorco2.24pe FMPI_scafun_Iprobe27969 1.7365 libmpi.sorco2.24pe decode_8_u_6427475 1.7058 libmpi.sorco2.24pe scafun_rq_anysrc_fair_one25966 1.6121 libmpi.sorco2.24pe scafun_uxq_probe
                   24380     1.5137  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2.24pememcpy
22615 1.4041 libmpi.sorco2.24pe .plt21172 1.3145 rco2.24perco2.24pe diag_
                   20716     1.2862  libc-2.5.so <http://libc-2.5.so>
<http://libc-2.5.so> rco2.24pememset
18565 1.1526 libmpi.sorco2.24pe openib_wrapper_poll_cq18192 1.1295 rco2.24perco2.24pe clinic_17135 1.0638 libmpi.sorco2.24pe PMPI_Iprobe16685 1.0359 rco2.24perco2.24pe k_epsi_16236 1.0080 libmpi.sorco2.24pe PMPI_Unpack15563 0.9662 libmpi.sorco2.24pe scafun_r_rq_append14829 0.9207 libmpi.sorco2.24pe scafun_rq_test_finished13349 0.8288 rco2.24perco2.24pe s_recv_12490 0.7755 libmpi.sorco2.24pe flop_matchandprobe12427 0.7715 libibverbs.so.1.0.0rco2.24pe (no symbols)12272 0.7619 libmpi.sorco2.24pe scafun_rq_handle12146 0.7541 rco2.24perco2.24pe growth_10175 0.6317 libmpi.sorco2.24pe wrp2p_test_finished9888 0.6139 libimf.sorco2.24pe exp.L9179 0.5699 rco2.24perco2.24pe subbasin_9082 0.5639 rco2.24perco2.24pe testnrecv_8901 0.5526 libmpi.sorco2.24pe openib_wrapper_purge_requests7425 0.4610 rco2.24perco2.24pe scobimain_7378 0.4581 rco2.24perco2.24pe scobi_interface_6530 0.4054 rco2.24perco2.24pe setvbc_6471 0.4018 libfmpi.sorco2.24pe pmpi_iprobe6341 0.3937 rco2.24perco2.24pe snap_
               _______________________________________________
               users mailing list
               us...@open-mpi.org <mailto:us...@open-mpi.org>
        <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>

               http://www.open-mpi.org/mailman/listinfo.cgi/users



           --    ---------------------------------------------------------
            Torgny Faxén               National Supercomputer Center
            Linköping University       S-581 83 Linköping
SwedenEmail:fa...@nsc.liu.se <mailto:email%3afa...@nsc.liu.se>
        <mailto:email%3afa...@nsc.liu.se
        <mailto:email%253afa...@nsc.liu.se>>

            Telephone: +46 13 285798 (office) +46 13 282535  (fax)
            http://www.nsc.liu.se
           ---------------------------------------------------------


           _______________________________________________
           users mailing list
           us...@open-mpi.org <mailto:us...@open-mpi.org>
        <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>

           http://www.open-mpi.org/mailman/listinfo.cgi/users


        ------------------------------------------------------------------------



        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users
-----------------------------------------------------------Torgny FaxénNational Supercomputer CenterLinköping UniversityS-581 83 LinköpingSweden
     Email:fa...@nsc.liu.se <mailto:email%3afa...@nsc.liu.se>
     Telephone: +46 13 285798 (office) +46 13 282535  (fax)
     http://www.nsc.liu.se
    ---------------------------------------------------------


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users


------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---------------------------------------------------------
  Torgny Faxén          
  National Supercomputer Center
  Linköping University  
  S-581 83 Linköping
  Sweden        

  Email:fa...@nsc.liu.se
  Telephone: +46 13 285798 (office) +46 13 282535  (fax)
  http://www.nsc.liu.se
---------------------------------------------------------

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

Reply via email to