Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path

2021-02-08 Thread Patrick Begou via users
Hi,

just to update this discussion and answer some questions also here:

1. What version of IFS are you running?
ii  ifs-kernel-updates-dev 1:3.10.0-1062-2123-1ifs+deb9  
amd64Development headers for Intel HFI1 driver interface
ii  kmod-ifs-kernel-updates-4.9.0-13-amd64 1:3.10.0-1062-2123-1ifs+deb9  
amd64Updated kernel modules for Omni-Path
ii  kmod-ifs-kernel-updates-4.9.0-6-amd64  1:3.10.0-514-724-2ifs+deb9
amd64Updated kernel modules for Omni-Path
rc  kmod-ifs-kernel-updates-4.9.0-8-amd64  1:3.10.0-957-1793-1ifs+deb9   
amd64Updated kernel modules for Omni-Path

2. Are you using CUDA cards by any chance?
No, these nodes do not have any GPU and the code is MPI only (no hybrid 
implementation)

These last days
- I have compiled the libpsm2 from the github sources but it seams to be
the same level of development as the one installed. And it does not
solve the problem.
- Another user tried to deploy "automagically" OpenMPI with Spack tool
but the problem is also found
- The problem also exist with the OpenMPI 4.0.3 provide by the O.S.
- I try to run a test with mpich (installed in the O.S.) but it is not
compatible with the local batch scheduler and the install is not
functionnal.
- I've downgraded my simulation code back to point to point
communications (12% slower) as a workaround for the PhD students on this
supercomputer while a solution is found (so they can work)
- I've opened an issue on https://github.com/cornelisnetworks/opa-psm2
describing the problem and providing the test-case. Thanks to Michael
who is looking at this.

Patrick

Le 28/01/2021 à 17:52, Heinz, Michael William via users a écrit :
> Patrick,
>
> A few more questions for you:
>
> 1. What version of IFS are you running?
> 2. Are you using CUDA cards by any chance? If so, what version of CUDA?
>
> -Original Message-
> From: Heinz, Michael William 
> Sent: Wednesday, January 27, 2021 3:45 PM
> To: Open MPI Users 
> Subject: RE: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> Patrick,
>
> Do you have any PSM2_* or HFI_* environment variables defined in your run 
> time environment that could be affecting things?
>
>
> -Original Message-
> From: users  On Behalf Of Heinz, Michael 
> William via users
> Sent: Wednesday, January 27, 2021 3:37 PM
> To: Open MPI Users 
> Cc: Heinz, Michael William 
> Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or 
> by Cornelis Networks - but I should point out you can download the latest 
> official source for PSM2 and the drivers from Github.
>
> -Original Message-
> From: users  On Behalf Of Michael Di 
> Domenico via users
> Sent: Wednesday, January 27, 2021 3:32 PM
> To: Open MPI Users 
> Cc: Michael Di Domenico 
> Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
>
> if you have OPA cards, for openmpi you only need --with-ofi, you don't need 
> psm/psm2/verbs/ucx.  but this assumes you're running a rhel based distro and 
> have installed the OPA fabric suite of software from Intel/CornelisNetworks.  
> which is what i have.  perhaps there's something really odd in debian or 
> there's an incompatibility with the older ofed drivers perhaps included with 
> debian.  unfortunately i don't have access to a debian, so i can't be much 
> more help
>
> if i had to guess totally pulling junk from the air, there's probably 
> something incompatible with PSM and OPA when running specifically on debian 
> (likely due to library versioning).  i don't know how common that is, so it's 
> not clear how flushed out and tested it is
>
>
>
>
> On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users 
>  wrote:
>> Hi Howard and Michael
>>
>> first many thanks for testing with my short application. Yes, when the 
>> test code runs fine it just show the max RSS size of rank 0 process.
>> When it runs wrong it put a messages about each invalid value found.
>>
>> As I said, I have also deployed OpenMPI on various cluster (in DELL 
>> data center at Austin) when I was testing some architectures some 
>> months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any 
>> problem. The goal was rutryied nning my tests with same software stacks and 
>> be sure to be able to deploy my software stack on the selected solution.
>> But as your clusters (and my small local clusters) they were all 
>> running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 
>> 10).
>> The university's cluster I have access is running Debian stretch and 
>> provides GCC6 as default compiler.
>>
>> I cannot ask for a different OS, but I can deploy a local gcc10 and 
>> build again OpenMPI.  UCX is not available on this cluster, should I 
>> deploy a local UCX too ?
>>
>> Libpsm2 seams good:
>> dahu103 : dpkg -l |grep psm
>> ii  libfabric-psm  1.10.0-2-1ifs+deb9amd64 

Re: [OMPI users] OMPI 4.1 in Cygwin packages?

2021-02-08 Thread Martín Morales via users
Hi Marco,

Apologies for my delay. I tried 4.1.0 and it worked!!
Thank you very much for your assistance. Kind regards,

Martín

From: Marco Atzeri
Sent: sábado, 6 de febrero de 2021 08:54
To: Martín Morales; Open MPI 
Users
Subject: Re: [OMPI users] OMPI 4.1 in Cygwin packages?

Martin,

what is the IP address of the machine you can not connect ?

All those VMware interfaces look suspicious, anyway.


In the mean time I uploaded 4.1.0-1 for X86_64,
you can try to see if solve the issue.

the i686 version in still in build phase


On 05.02.2021 20:46, Martín Morales wrote:
> Hi Marcos,
>
> Pasted below the output.
>
> Thank you. Regards,
>
> Martín

>
> /internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}/
>
> /flags: AF_INET6 up running multicast/
>
> /address:   fe80::e5c6:c83:8653:3cd8%14/
>
> /friendly_name: VMware Network Adapter VMnet1/
>
> //
>
> /internal_name:  {A6301D34-A586-4439-B7A7-69FA905CA167}/
>
> /flags: AF_INET  up broadcast running multicast/
>
> /address:   192.168.148.1/
>
> /friendly_name: VMware Network Adapter VMnet1/
>