It's the public source.  The one I'm testing with is the latest internal version.  I'm going to cc Pete Mendygral and Julius Donnert on this as they may be able to provide you the version I'm using (as it is not ready for public use).

    I must say, this is eerily appropo. I've just sent a request for
    Wombat last week as I was planning to have my group start looking
    at the performance of UCX OSC on IB. We are most interested in
    ensuring UCX OSC MT performs well on Wombat. The bitbucket you're
    referencing; is this the source code? Can we build and run it?



        I forgot to include that we have not rebuilt this OpenMPI
        4.0.1 against 1.6.0 of UCX but rather 1.5.1. When we upgraded
        to 1.6.0 everything seemed to be working for OpenMPI when we
        swapped the UCX version with out recompiling (at least in
        normal rank level MPI as we had to do the upgrade to UCX to
        get MPI_THREAD_MULTIPLE to work at all).

        Sure.  The code I'm using is the latest version of Wombat
        ( ,
        I'm using an unreleased updated version as I know the devs). 
        I'm using OMP_THREAD_NUM=12 and the command line is:

        mpirun -np 16 --hostfile hosts ./wombat

        Where the host file lists 4 machines, so 4 ranks per machine
        and 12 threads per rank.  Each node has 48 Intel Cascade Lake
        cores. I've also tried using the Slurm scheduler version
        which is:

        srun -n 16 -c 12 --mpi=pmix ./wombat

        Which also hangs.  It works if I constrain to one or two
        nodes but any greater than that hangs.  As for network hardware:

        [root@holy7c02101 ~]# ibstat
        CA 'mlx5_0'
                CA type: MT4119
                Number of ports: 1
                Firmware version: 16.25.6000
                Hardware version: 0
                Node GUID: 0xb8599f0300158f20
                System image GUID: 0xb8599f0300158f20
                Port 1:
                        State: Active
                        Physical state: LinkUp
                        Rate: 100
                        Base lid: 808
                        LMC: 1
                        SM lid: 584
                        Capability mask: 0x2651e848
                        Port GUID: 0xb8599f0300158f20
                        Link layer: InfiniBand

        [root@holy7c02101 ~]# lspci | grep Mellanox
        58:00.0 Infiniband controller: Mellanox Technologies MT27800
        Family [ConnectX-5]

        As for IB RDMA kernel stack we are using the default drivers
        that come with CentOS 7.6.1810 which is rdma core 17.2-3.

        I will note that I successfully ran an old version of Wombat
        on all 30,000 cores of this system using OpenMPI 3.1.3 and
        regular IB Verbs with no problem earlier this week, though
        that was pure MPI ranks with no threads.  Nonetheless the
        fabric itself is healthy and in good shape.  It seems to be
        this edge case using the latest OpenMPI with UCX and threads
        that is causing the hang ups.  To be sure the latest version
        of Wombat (as I believe the public version does as well) uses
        many of the state of the art MPI RMA direct calls, so its
        definitely pushing the envelope in ways our typical user base
        here will not.  Still it would be good to iron out this kink
        so if users do hit it we have a solution.  As noted UCX is
        very new to us and thus it is entirely possible that we are
        missing something in its interaction with OpenMPI.  Our MPI
        is compiled thusly:

        I will note that when I built this it was built using the
        default version of UCX that comes with EPEL (1.5.1).  We only
        built 1.6.0 as the version provided by EPEL did not build
        with MT enabled, which to me seems strange as I don't see any
        reason not to build with MT enabled.  Anyways that's the
        deeper context.

        Can you provide a repro and command line, please. Also, what
        network hardware are you using?


            I have a code using MPI_THREAD_MULTIPLE along with
            MPI-RMA that I'm
            using OpenMPI 4.0.1.  Since 4.0.1 requires UCX I have it
            installed with
            MT on (1.6.0 build).  The thing is that the code keeps
            stalling out when
            I go above a couple of nodes.  UCX is new to our
            environment as
            previously we have just used the regular IB Verbs with
            no problem.  My
            guess is that there is either some option in OpenMPI I
            am missing or
            some variable in UCX I am not setting.  Any insight on
            what could be
            causing the stalls?

