On 26/10/17 22:42, Chris Samuel wrote:
> I'm helping another group out and we've found that running an Open-MPI
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and
> 5
> cards using RoCE (the mlx5 driver). The node just locks up hard with no
> OOPS
> or other
On Friday, 27 October 2017 8:58:02 AM AEDT Lance Wilson wrote:
> We are running CX4 cards and have had some issues as well. Which version/s
> of openmpi are they running?
This is with OMPI 1.10.x, 2.0.2 and 3.0.0.
Unfortunately only OMPI 3.0.0 seems compatible with their Slurm install
On Friday, 27 October 2017 3:30:29 AM AEDT Ryan Novosielski wrote:
> Where is this driver from? OS, or OFED, or?
OFED 4.1 sorry.
> We use primarily MVAPICH2 but I would be curious to try to duplicate this on
> our mlx5 equipment.
>
> What model cards do you have?
These are MT27710 and MT27800
Hi Chris,
We are running CX4 cards and have had some issues as well. Which version/s
of openmpi are they running?
If you follow the instructions from Mellanox and run with yalla and mxm
that works(ish) of openmpi 1.10.3, including setting the appropriate
environment variables or config file.
If
Where is this driver from? OS, or OFED, or?
We use primarily MVAPICH2 but I would be curious to try to duplicate this on
our mlx5 equipment.
What model cards do you have?
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan
Hi folks,
I'm helping another group out and we've found that running an Open-MPI
program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5
cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS
or other diagnostics and has to be power cycled.