Re: [Beowulf] Killing nodes with Open-MPI?

2017-11-05 Thread Christopher Samuel
On 26/10/17 22:42, Chris Samuel wrote: > I'm helping another group out and we've found that running an Open-MPI > program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and > 5 > cards using RoCE (the mlx5 driver). The node just locks up hard with no > OOPS > or other

Re: [Beowulf] Killing nodes with Open-MPI?

2017-10-26 Thread Chris Samuel
On Friday, 27 October 2017 8:58:02 AM AEDT Lance Wilson wrote: > We are running CX4 cards and have had some issues as well. Which version/s > of openmpi are they running? This is with OMPI 1.10.x, 2.0.2 and 3.0.0. Unfortunately only OMPI 3.0.0 seems compatible with their Slurm install

Re: [Beowulf] Killing nodes with Open-MPI?

2017-10-26 Thread Chris Samuel
On Friday, 27 October 2017 3:30:29 AM AEDT Ryan Novosielski wrote: > Where is this driver from? OS, or OFED, or? OFED 4.1 sorry. > We use primarily MVAPICH2 but I would be curious to try to duplicate this on > our mlx5 equipment. > > What model cards do you have? These are MT27710 and MT27800

Re: [Beowulf] Killing nodes with Open-MPI?

2017-10-26 Thread Lance Wilson via Beowulf
Hi Chris, We are running CX4 cards and have had some issues as well. Which version/s of openmpi are they running? If you follow the instructions from Mellanox and run with yalla and mxm that works(ish) of openmpi 1.10.3, including setting the appropriate environment variables or config file. If

Re: [Beowulf] Killing nodes with Open-MPI?

2017-10-26 Thread Ryan Novosielski
Where is this driver from? OS, or OFED, or? We use primarily MVAPICH2 but I would be curious to try to duplicate this on our mlx5 equipment. What model cards do you have? -- || \\UTGERS, |---*O*--- ||_// the State | Ryan

[Beowulf] Killing nodes with Open-MPI?

2017-10-26 Thread Chris Samuel
Hi folks, I'm helping another group out and we've found that running an Open-MPI program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5 cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS or other diagnostics and has to be power cycled.