fairly frequently, but not everytime when trying to run xhpl on a new
machine i'm bumping into this.  it happens with a single node or
multiple nodes

node1 selected pml ob1, but peer on node1 selected pml ucx

if i rerun the exact same command a few minutes later, it works fine.
the machine is new and i'm the only one using it so there are no user
conflicts

the software stack is

slurm 21.8.2.1
ompi 4.1.1
pmix 3.2.3
ucx 1.9.0

the hardware is HPE w/ mellanox edr cards (but i doubt that matters)

any thoughts?

Reply via email to