Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Patrick, Do you have any PSM2_* or HFI_* environment variables defined in your run time environment that could be affecting things? -Original Message- From: users On Behalf Of Heinz, Michael William via users Sent: Wednesday, January 27, 2021 3:37 PM To: Open MPI Users Cc: Heinz,

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Heinz, Michael William via users
Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by Cornelis Networks - but I should point out you can download the latest official source for PSM2 and the drivers from Github. -Original Message- From: users On Behalf Of Michael Di Domenico via users Sent:

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
if you have OPA cards, for openmpi you only need --with-ofi, you don't need psm/psm2/verbs/ucx. but this assumes you're running a rhel based distro and have installed the OPA fabric suite of software from Intel/CornelisNetworks. which is what i have. perhaps there's something really odd in

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Patrick Begou via users
Hi Howard and Michael first many thanks for testing with my short application. Yes, when the test code runs fine it just show the max RSS size of rank 0 process. When it runs wrong it put a messages about each invalid value found. As I said, I have also deployed OpenMPI on various cluster (in

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Pritchard Jr., Howard via users
Hi Folks, I'm also have problems reproducing this on one of our OPA clusters: libpsm2-11.2.78-1.el7.x86_64 libpsm2-devel-11.2.78-1.el7.x86_64 cluster runs RHEL 7.8 hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-27 Thread Michael Di Domenico via users
for whatever it's worth running the test program on my OPA cluster seems to work. well it keeps spitting out [INFO MEMORY] lines, not sure if it's supposed to stop at some point i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs} On Tue, Jan 26, 2021 at 3:44 PM