Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-16 Thread Cabral, Matias A
Hey Eduardo,

Using up to date libraries is advisable, especially given that 1.4.0 is a 
couple years old.  1.6.2 is the latest on the 1.6.x line. 1.7.0 was released 
last week, however I have not played with it yet.

Thanks
_MAC

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Wednesday, January 16, 2019 9:29 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Matias,



thanks so much for your support!



Actually running this simple example with --mca mtl_ofi_tag_mode ofi_tag_1  
turns out to be a good choice! I mean, the following execution do  not return 
the MPI_Send error any more:



mpirun -np 2 --mca mtl_ofi_tag_mode ofi_tag_1 ./a



Are you suggesting that upgrading  libfabric   to  1.6.0   might save the day?



Regards,

Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com> 
mailto:matias.a.cab...@intel.com>>
Envoyé : mercredi 16 janvier 2019 00:54
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,


> When you say that "The OFI MTL got some new features during 2018 that went 
> into v4.0.0 but are not backported to older OMPI versions." this agrees with 
> the bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. 
> Could this be related?

Yes. I suspect this may be related to the inclusion of support for 
FI_REMOTE_CQ_DATA that was added to extend scalability of OFI MTL. This went 
into 4.x, but is not in 3.1.x. In addition there is a bug in the PSM2 OFI 
provider that reports support for FI_REMOTE_CQ_DATA when it actually did not 
support the API that OMPI uses (this was fixed in libfabric 1.6.0). A quick way 
to test this would be adding  '-mca mtl_ofi_tag_mode ofi_tag_1' to your command 
line. This would force OMPI not using FI_REMOTE_CQ_DATA.

Thanks,

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Tuesday, January 15, 2019 2:31 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Matias,



Thank you so much for your feedback!



It's really embarrassing, but running



mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 ./a



still doesn't get the job done. I'm still getting the same MPI_Send error:



Hello World from proccess 1 out of 2
Hello World from proccess 0 out of 2
[gafront4:18272] *** An error occurred in MPI_Send
[gafront4:18272] *** reported by process [2565799937,0]
[gafront4:18272] *** on communicator MPI_COMM_WORLD
[gafront4:18272] *** MPI_ERR_OTHER: known error not in list
[gafront4:18272] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[gafront4:18272] ***and potentially your MPI job)



I'm using libfabric-1.4.0  issued from Debian Stretch with a minor modification 
to use PSM2. It can be found here:



https://github.com/scibian/libfabric/commits/scibian/opa10.7/stretch



When you say that "The OFI MTL got some new features during 2018 that went into 
v4.0.0 but are not backported to older OMPI versions." this agrees with the 
bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. Could 
this be related?



Regards,

Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com> 
mailto:matias.a.cab...@intel.com>>
Envoyé : samedi 12 janvier 2019 00:32
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

BTW, just to be explicit about using the psm2 OFI provider:

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 
./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Cabral, 
Matias A
Sent: Friday, January 11, 2019 3:22 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I'm running some quick tests and I see it 
working:



/tmp >ompi_info
 Package: Open MPI 
macab...@sperf-41.sc.intel.com<mailto:macab...@sperf-41.sc.intel.com>
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo r

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-16 Thread ROTHE Eduardo - externe
Hi Matias,


thanks so much for your support!


Actually running this simple example with --mca mtl_ofi_tag_mode ofi_tag_1  
turns out to be a good choice! I mean, the following execution do  not return 
the MPI_Send error any more:


mpirun -np 2 --mca mtl_ofi_tag_mode ofi_tag_1 ./a


Are you suggesting that upgrading  libfabric   to  1.6.0   might save the day?


Regards,

Eduardo



De : users  de la part de 
matias.a.cab...@intel.com 
Envoyé : mercredi 16 janvier 2019 00:54
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,


> When you say that "The OFI MTL got some new features during 2018 that went 
> into v4.0.0 but are not backported to older OMPI versions." this agrees with 
> the bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. 
> Could this be related?

Yes. I suspect this may be related to the inclusion of support for 
FI_REMOTE_CQ_DATA that was added to extend scalability of OFI MTL. This went 
into 4.x, but is not in 3.1.x. In addition there is a bug in the PSM2 OFI 
provider that reports support for FI_REMOTE_CQ_DATA when it actually did not 
support the API that OMPI uses (this was fixed in libfabric 1.6.0). A quick way 
to test this would be adding  ‘-mca mtl_ofi_tag_mode ofi_tag_1’ to your command 
line. This would force OMPI not using FI_REMOTE_CQ_DATA.

Thanks,

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Tuesday, January 15, 2019 2:31 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Matias,



Thank you so much for your feedback!



It's really embarrassing, but running



mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 ./a



still doesn't get the job done. I'm still getting the same MPI_Send error:



Hello World from proccess 1 out of 2
Hello World from proccess 0 out of 2
[gafront4:18272] *** An error occurred in MPI_Send
[gafront4:18272] *** reported by process [2565799937,0]
[gafront4:18272] *** on communicator MPI_COMM_WORLD
[gafront4:18272] *** MPI_ERR_OTHER: known error not in list
[gafront4:18272] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[gafront4:18272] ***and potentially your MPI job)



I'm using libfabric-1.4.0  issued from Debian Stretch with a minor modification 
to use PSM2. It can be found here:



https://github.com/scibian/libfabric/commits/scibian/opa10.7/stretch



When you say that "The OFI MTL got some new features during 2018 that went into 
v4.0.0 but are not backported to older OMPI versions." this agrees with the 
bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. Could 
this be related?



Regards,

Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com> 
mailto:matias.a.cab...@intel.com>>
Envoyé : samedi 12 janvier 2019 00:32
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

BTW, just to be explicit about using the psm2 OFI provider:

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 
./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Cabral, 
Matias A
Sent: Friday, January 11, 2019 3:22 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I’m running some quick tests and I see it 
working:



/tmp >ompi_info
 Package: Open MPI 
macab...@sperf-41.sc.intel.com<mailto:macab...@sperf-41.sc.intel.com>
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo revision: v4.0.0
   Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
  OPAL repo revision: v4.0.0
   OPAL release date: Unreleased developer copy
 MPI API: 3.1.0
Ident string: 4.0.0rc5
  Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
  Configure host: sperf-41.sc.intel.com
   Configured by: macabral
   Configured on: Fri Jan 11 17:42:06 EST 2019
  Configure host: sperf-41.sc.intel.com
  Configure command line: '--with-ofi' '--with-verbs=no'
 

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-15 Thread Cabral, Matias A
Hi Eduardo,


> When you say that "The OFI MTL got some new features during 2018 that went 
> into v4.0.0 but are not backported to older OMPI versions." this agrees with 
> the bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. 
> Could this be related?

Yes. I suspect this may be related to the inclusion of support for 
FI_REMOTE_CQ_DATA that was added to extend scalability of OFI MTL. This went 
into 4.x, but is not in 3.1.x. In addition there is a bug in the PSM2 OFI 
provider that reports support for FI_REMOTE_CQ_DATA when it actually did not 
support the API that OMPI uses (this was fixed in libfabric 1.6.0). A quick way 
to test this would be adding  '-mca mtl_ofi_tag_mode ofi_tag_1' to your command 
line. This would force OMPI not using FI_REMOTE_CQ_DATA.

Thanks,

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Tuesday, January 15, 2019 2:31 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Matias,



Thank you so much for your feedback!



It's really embarrassing, but running



mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 ./a



still doesn't get the job done. I'm still getting the same MPI_Send error:



Hello World from proccess 1 out of 2
Hello World from proccess 0 out of 2
[gafront4:18272] *** An error occurred in MPI_Send
[gafront4:18272] *** reported by process [2565799937,0]
[gafront4:18272] *** on communicator MPI_COMM_WORLD
[gafront4:18272] *** MPI_ERR_OTHER: known error not in list
[gafront4:18272] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[gafront4:18272] ***and potentially your MPI job)



I'm using libfabric-1.4.0  issued from Debian Stretch with a minor modification 
to use PSM2. It can be found here:



https://github.com/scibian/libfabric/commits/scibian/opa10.7/stretch



When you say that "The OFI MTL got some new features during 2018 that went into 
v4.0.0 but are not backported to older OMPI versions." this agrees with the 
bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. Could 
this be related?



Regards,

Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com> 
mailto:matias.a.cab...@intel.com>>
Envoyé : samedi 12 janvier 2019 00:32
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

BTW, just to be explicit about using the psm2 OFI provider:

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 
./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Cabral, 
Matias A
Sent: Friday, January 11, 2019 3:22 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I'm running some quick tests and I see it 
working:



/tmp >ompi_info
 Package: Open MPI 
macab...@sperf-41.sc.intel.com<mailto:macab...@sperf-41.sc.intel.com>
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo revision: v4.0.0
   Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
  OPAL repo revision: v4.0.0
   OPAL release date: Unreleased developer copy
 MPI API: 3.1.0
Ident string: 4.0.0rc5
  Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
  Configure host: sperf-41.sc.intel.com
   Configured by: macabral
   Configured on: Fri Jan 11 17:42:06 EST 2019
  Configure host: sperf-41.sc.intel.com
  Configure command line: '--with-ofi' '--with-verbs=no'
  '--prefix=/tmp/ompi-4.0.0'

/tmp> rpm -qi libfabric
Name: libfabric
Version : 1.6.0
Release : 80
Architecture: x86_64
Install Date: Wed 19 Dec 2018 05:45:41 PM EST
Group   : System Environment/Libraries
Size: 10131964
License : GPLv2 or BSD
Signature   : (none)
Source RPM  : libfabric-1.6.0-80.src.rpm
Build Date  : Wed 22 Aug 2018 11:08:29 PM EDT
Build Host  : ph-bld-node-27.ph.intel.com
Relocations : (not relocatable)
URL : http://www.github.com/ofiwg/libfabric
Summary : User-space RDMA Fabric Interfaces
Description 

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-15 Thread ROTHE Eduardo - externe
Hi Matias,


Thank you so much for your feedback!


It's really embarrassing, but running


mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 ./a


still doesn't get the job done. I'm still getting the same MPI_Send error:


Hello World from proccess 1 out of 2
Hello World from proccess 0 out of 2
[gafront4:18272] *** An error occurred in MPI_Send
[gafront4:18272] *** reported by process [2565799937,0]
[gafront4:18272] *** on communicator MPI_COMM_WORLD
[gafront4:18272] *** MPI_ERR_OTHER: known error not in list
[gafront4:18272] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[gafront4:18272] ***and potentially your MPI job)


I'm using libfabric-1.4.0  issued from Debian Stretch with a minor modification 
to use PSM2. It can be found here:


https://github.com/scibian/libfabric/commits/scibian/opa10.7/stretch


When you say that "The OFI MTL got some new features during 2018 that went into 
v4.0.0 but are not backported to older OMPI versions." this agrees with the 
bahaviour that I witness - using Open MPI 3.1.3 I don't get this error. Could 
this be related?


Regards,

Eduardo



De : users  de la part de 
matias.a.cab...@intel.com 
Envoyé : samedi 12 janvier 2019 00:32
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

BTW, just to be explicit about using the psm2 OFI provider:

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 
./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Cabral, 
Matias A
Sent: Friday, January 11, 2019 3:22 PM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I’m running some quick tests and I see it 
working:



/tmp >ompi_info
 Package: Open MPI 
macab...@sperf-41.sc.intel.com<mailto:macab...@sperf-41.sc.intel.com>
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo revision: v4.0.0
   Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
  OPAL repo revision: v4.0.0
   OPAL release date: Unreleased developer copy
 MPI API: 3.1.0
Ident string: 4.0.0rc5
  Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
  Configure host: sperf-41.sc.intel.com
   Configured by: macabral
   Configured on: Fri Jan 11 17:42:06 EST 2019
  Configure host: sperf-41.sc.intel.com
  Configure command line: '--with-ofi' '--with-verbs=no'
  '--prefix=/tmp/ompi-4.0.0'
….
/tmp> rpm -qi libfabric
Name: libfabric
Version : 1.6.0
Release : 80
Architecture: x86_64
Install Date: Wed 19 Dec 2018 05:45:41 PM EST
Group   : System Environment/Libraries
Size: 10131964
License : GPLv2 or BSD
Signature   : (none)
Source RPM  : libfabric-1.6.0-80.src.rpm
Build Date  : Wed 22 Aug 2018 11:08:29 PM EDT
Build Host  : ph-bld-node-27.ph.intel.com
Relocations : (not relocatable)
URL : http://www.github.com/ofiwg/libfabric
Summary : User-space RDMA Fabric Interfaces
Description :
libfabric provides a user-space API to access high-performance fabric
services, such as RDMA.

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm ./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Thursday, January 10, 2019 10:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Gilles, thank you so much once again!

I have a success using directly the psm2 mtl. Indeed, I do not need to use the 
cm pml (I guess this might be because the cm pml gets automatically selected 
when I enforce the psm2 mtl?). So both the following two commands execute 
successfully with Open MPI 4.0.0:

  > mpirun --mca pml cm --mca mtl psm2 -np 2 ./a.out
  > mpirun --mca mtl psm2 -np 2 ./a.out

The error persists using libfabric. The following command returns the MPI_Send 
error:

  > mpirun --mca pml cm --mca mtl ofi -np 2 ./a.out

It seems the problem sits between libfabric and Open MPI 4.0.0 (remember, I 
don't see the same behaviour with O

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-11 Thread Cabral, Matias A
BTW, just to be explicit about using the psm2 OFI provider:

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm -mca mtl_ofi_provider_include psm2 
./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Cabral, 
Matias A
Sent: Friday, January 11, 2019 3:22 PM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I'm running some quick tests and I see it 
working:



/tmp >ompi_info
 Package: Open MPI 
macab...@sperf-41.sc.intel.com<mailto:macab...@sperf-41.sc.intel.com>
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo revision: v4.0.0
   Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
  OPAL repo revision: v4.0.0
   OPAL release date: Unreleased developer copy
 MPI API: 3.1.0
Ident string: 4.0.0rc5
  Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
  Configure host: sperf-41.sc.intel.com
   Configured by: macabral
   Configured on: Fri Jan 11 17:42:06 EST 2019
  Configure host: sperf-41.sc.intel.com
  Configure command line: '--with-ofi' '--with-verbs=no'
  '--prefix=/tmp/ompi-4.0.0'

/tmp> rpm -qi libfabric
Name: libfabric
Version : 1.6.0
Release : 80
Architecture: x86_64
Install Date: Wed 19 Dec 2018 05:45:41 PM EST
Group   : System Environment/Libraries
Size: 10131964
License : GPLv2 or BSD
Signature   : (none)
Source RPM  : libfabric-1.6.0-80.src.rpm
Build Date  : Wed 22 Aug 2018 11:08:29 PM EDT
Build Host  : ph-bld-node-27.ph.intel.com
Relocations : (not relocatable)
URL : http://www.github.com/ofiwg/libfabric
Summary : User-space RDMA Fabric Interfaces
Description :
libfabric provides a user-space API to access high-performance fabric
services, such as RDMA.

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm ./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Thursday, January 10, 2019 10:02 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Gilles, thank you so much once again!

I have a success using directly the psm2 mtl. Indeed, I do not need to use the 
cm pml (I guess this might be because the cm pml gets automatically selected 
when I enforce the psm2 mtl?). So both the following two commands execute 
successfully with Open MPI 4.0.0:

  > mpirun --mca pml cm --mca mtl psm2 -np 2 ./a.out
  > mpirun --mca mtl psm2 -np 2 ./a.out

The error persists using libfabric. The following command returns the MPI_Send 
error:

  > mpirun --mca pml cm --mca mtl ofi -np 2 ./a.out

It seems the problem sits between libfabric and Open MPI 4.0.0 (remember, I 
don't see the same behaviour with Open MPI 3.1.3). So I guess if I want to use 
libfabric I will have to dig a bit more regarding the interface between this 
library and Open MPI 4.0.0. Any lines of thought on how to start here would be 
(very!) appreciated.

If you have any scheme that would help me to understand the framework/modules 
architecture and why some modules are automatically selected (like the above 
case), I would be very pleased (even more!?:).

Regards,
Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
mailto:gilles.gouaillar...@gmail.com>>
Envoyé : jeudi 10 janvier 2019 13:51
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Eduardo,

You have two options to use OmniPath

- "directly" via the psm2 mtl
mpirun -mca pml cm -mca mtl psm2 ...

- "indirectly" via libfabric
mpirun -mca pml cm -mca mtl ofi ...

I do invite you to try both. By explicitly requesting the mtl you will avoid 
potential conflicts.

libfabric is used in production by Cisco and AWS (both major contributors to 
both Open MPI and libfabric) so this is clearly not something to stay away 
from. That being said, bug always happen and they could be related to Open MPI, 
libfabric and/or OmniPath (and fwiw, Intel is a major contributor to

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-11 Thread Cabral, Matias A
Hi Eduardo,

The OFI MTL got some new features during 2018 that went into v4.0.0 but are not 
backported to older OMPI versions.

What version of libfabric are you using and where are you installing it from?  
I will try to reproduce your error. I'm running some quick tests and I see it 
working:


/tmp >ompi_info
 Package: Open MPI macab...@sperf-41.sc.intel.com
  Distribution
Open MPI: 4.0.0rc5
  Open MPI repo revision: v4.0.0
   Open MPI release date: Unreleased developer copy
Open RTE: 4.0.0rc5
  Open RTE repo revision: v4.0.0
   Open RTE release date: Unreleased developer copy
OPAL: 4.0.0rc5
  OPAL repo revision: v4.0.0
   OPAL release date: Unreleased developer copy
 MPI API: 3.1.0
Ident string: 4.0.0rc5
  Prefix: /nfs/sc/disks/fabric_work/macabral/tmp/ompi-4.0.0
Configured architecture: x86_64-unknown-linux-gnu
  Configure host: sperf-41.sc.intel.com
   Configured by: macabral
   Configured on: Fri Jan 11 17:42:06 EST 2019
  Configure host: sperf-41.sc.intel.com
  Configure command line: '--with-ofi' '--with-verbs=no'
  '--prefix=/tmp/ompi-4.0.0'

/tmp> rpm -qi libfabric
Name: libfabric
Version : 1.6.0
Release : 80
Architecture: x86_64
Install Date: Wed 19 Dec 2018 05:45:41 PM EST
Group   : System Environment/Libraries
Size: 10131964
License : GPLv2 or BSD
Signature   : (none)
Source RPM  : libfabric-1.6.0-80.src.rpm
Build Date  : Wed 22 Aug 2018 11:08:29 PM EDT
Build Host  : ph-bld-node-27.ph.intel.com
Relocations : (not relocatable)
URL : http://www.github.com/ofiwg/libfabric
Summary : User-space RDMA Fabric Interfaces
Description :
libfabric provides a user-space API to access high-performance fabric
services, such as RDMA.

/tmp> mpirun -np 2 -mca mtl ofi -mca pml cm ./a
Hello World from proccess 0 out of 2
This is process 0 reporting::
Hello World from proccess 1 out of 2
Process 1 received number 10 from process 0


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of ROTHE 
Eduardo - externe
Sent: Thursday, January 10, 2019 10:02 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send


Hi Gilles, thank you so much once again!

I have a success using directly the psm2 mtl. Indeed, I do not need to use the 
cm pml (I guess this might be because the cm pml gets automatically selected 
when I enforce the psm2 mtl?). So both the following two commands execute 
successfully with Open MPI 4.0.0:

  > mpirun --mca pml cm --mca mtl psm2 -np 2 ./a.out
  > mpirun --mca mtl psm2 -np 2 ./a.out

The error persists using libfabric. The following command returns the MPI_Send 
error:

  > mpirun --mca pml cm --mca mtl ofi -np 2 ./a.out

It seems the problem sits between libfabric and Open MPI 4.0.0 (remember, I 
don't see the same behaviour with Open MPI 3.1.3). So I guess if I want to use 
libfabric I will have to dig a bit more regarding the interface between this 
library and Open MPI 4.0.0. Any lines of thought on how to start here would be 
(very!) appreciated.

If you have any scheme that would help me to understand the framework/modules 
architecture and why some modules are automatically selected (like the above 
case), I would be very pleased (even more!?:).

Regards,
Eduardo




De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
mailto:gilles.gouaillar...@gmail.com>>
Envoyé : jeudi 10 janvier 2019 13:51
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Eduardo,

You have two options to use OmniPath

- "directly" via the psm2 mtl
mpirun -mca pml cm -mca mtl psm2 ...

- "indirectly" via libfabric
mpirun -mca pml cm -mca mtl ofi ...

I do invite you to try both. By explicitly requesting the mtl you will avoid 
potential conflicts.

libfabric is used in production by Cisco and AWS (both major contributors to 
both Open MPI and libfabric) so this is clearly not something to stay away 
from. That being said, bug always happen and they could be related to Open MPI, 
libfabric and/or OmniPath (and fwiw, Intel is a major contributor to libfabric 
too)

Cheers,

Gilles

On Thursday, January 10, 2019, ROTHE Eduardo - externe 
mailto:eduardo-externe.ro...@edf.fr>> wrote:

Hi Gilles, thank you so much for your support!

For now I'm just testing the software, so it's running on a single node.

Your suggestion was very precise. In fact, choosing the ob1 component leads to 
a successfull execution! The tcp component had no effect.

mpirun --mca pml ob1 -mca btl tcp,self -np 2 ./a.out > Success
mpirun --mca pml ob1 -np 2 ./a.out > Success

But... our cluster is equiped with Intel OMNI Path interconnects and we

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-10 Thread ROTHE Eduardo - externe
Hi Gilles, thank you so much once again!

I have a success using directly the psm2 mtl. Indeed, I do not need to use the 
cm pml (I guess this might be because the cm pml gets automatically selected 
when I enforce the psm2 mtl?). So both the following two commands execute 
successfully with Open MPI 4.0.0:

  > mpirun --mca pml cm --mca mtl psm2 -np 2 ./a.out
  > mpirun --mca mtl psm2 -np 2 ./a.out

The error persists using libfabric. The following command returns the MPI_Send 
error:

  > mpirun --mca pml cm --mca mtl ofi -np 2 ./a.out

It seems the problem sits between libfabric and Open MPI 4.0.0 (remember, I 
don't see the same behaviour with Open MPI 3.1.3). So I guess if I want to use 
libfabric I will have to dig a bit more regarding the interface between this 
library and Open MPI 4.0.0. Any lines of thought on how to start here would be 
(very!) appreciated.

If you have any scheme that would help me to understand the framework/modules 
architecture and why some modules are automatically selected (like the above 
case), I would be very pleased (even more!?:).

Regards,
Eduardo



De : users  de la part de 
gilles.gouaillar...@gmail.com 
Envoyé : jeudi 10 janvier 2019 13:51
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Eduardo,

You have two options to use OmniPath

- “directly” via the psm2 mtl
mpirun —mca pml cm —mca mtl psm2 ...

- “indirectly” via libfabric
mpirun —mca pml cm —mca mtl ofi ...

I do invite you to try both. By explicitly requesting the mtl you will avoid 
potential conflicts.

libfabric is used in production by Cisco and AWS (both major contributors to 
both Open MPI and libfabric) so this is clearly not something to stay away 
from. That being said, bug always happen and they could be related to Open MPI, 
libfabric and/or OmniPath (and fwiw, Intel is a major contributor to libfabric 
too)

Cheers,

Gilles

On Thursday, January 10, 2019, ROTHE Eduardo - externe 
mailto:eduardo-externe.ro...@edf.fr>> wrote:

Hi Gilles, thank you so much for your support!

For now I'm just testing the software, so it's running on a single node.

Your suggestion was very precise. In fact, choosing the ob1 component leads to 
a successfull execution! The tcp component had no effect.

mpirun --mca pml ob1 —mca btl tcp,self -np 2 ./a.out > Success
mpirun --mca pml ob1 -np 2 ./a.out > Success

But... our cluster is equiped with Intel OMNI Path interconnects and we are 
aiming to use psm2 through ofi component in order to take full advantage of 
this technology.

I believe your suggestion is showing that the problem is right here. But 
unfortunately I cannot see further.

Meanwhile, I've also compiled Open MPI 3.1.3 and I have a successfull run with 
the same options and the same environment (no MPI_Send error). Could Open MPI 
4.0.0 bring a different behaviour in this area? Eventually regarding ofi 
component?

Do you have any idea that I could put in practice to narrow the problem further?

Regards,
Eduardo

ps: I've recompiled Open MPI 4.0.0 using --with-hwloc=external, but with no 
different results (the same MPI_Send error);

ps2: Yes, the configure line thing is really fishy, the original line was 
--prefix=/opt/openmpi/4.0.0 --with-pmix=/usr/lib/x86_64-linux-gnu/pmix 
--with-libevent=external --with-slurm --enable-mpi-cxx --with-ofi 
--with-verbs=no --disable-silent-rules --with-hwloc=/usr 
--enable-mpirun-prefix-by-default --with-devel-headers



De : users 
mailto:users-boun...@lists.open-mpi.org>> de 
la part de gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com> 
mailto:gilles.gouaillar...@gmail.com>>
Envoyé : mercredi 9 janvier 2019 15:16
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Eduardo,

The first part of the configure command line is for an install in /usr, but 
then there is ‘—prefix=/opt/openmpi/4.0.0’ and this is very fishy.
You should also use ‘—with-hwloc=external’.

How many nodes are you running on and which interconnect are you using ?
What if you
mpirun —mca pml ob1 —mca btl tcp,self -np 2 ./a.out

Cheers,

Gilles

On Wednesday, January 9, 2019, ROTHE Eduardo - externe 
mailto:eduardo-externe.ro...@edf.fr>> wrote:
Hi.

I'm testing Open MPI 4.0.0 and I'm struggling with a weird behaviour. In a very 
simple example (very frustrating). I'm having the following error returned by 
MPI_Send:

  [gafront4:25692] *** An error occurred in MPI_Send
  [gafront4:25692] *** reported by process [3152019457,0]
  [gafront4:25692] *** on communicator MPI_COMM_WORLD
  [gafront4:25692] *** MPI_ERR_OTHER: known error not in list
  [gafront4:25692] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
  [gafront4:25692] ***and potentially your MPI job)

In the same machine I have other two instalations of Open MPI (2.0.2 and 2.1

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-10 Thread Peter Kjellström
On Thu, 10 Jan 2019 21:51:03 +0900
Gilles Gouaillardet  wrote:

> Eduardo,
> 
> You have two options to use OmniPath
> 
> - “directly” via the psm2 mtl
> mpirun —mca pml cm —mca mtl psm2 ...
> 
> - “indirectly” via libfabric
> mpirun —mca pml cm —mca mtl ofi ...
> 
> I do invite you to try both. By explicitly requesting the mtl you will
> avoid potential conflicts.
> 
> libfabric is used in production by Cisco and AWS (both major
> contributors to both Open MPI and libfabric) so this is clearly not
> something to stay away from.

Both me and a 2nd person investigated 4.0.0rc on Omnipath (see devel
list thread "Re: [OMPI devel] Announcing Open MPI v4.0.0rc1").

First both psm2 and ofi seemed broken but it turned out psm2 only had
problems because ofi got in the way. And ofi was not that easily
excluded since it also had a btl component.

Essentially I got it working by deleting all mca files matching *ofi*.

YMMV,
 Peter K
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-10 Thread Gilles Gouaillardet
Eduardo,

You have two options to use OmniPath

- “directly” via the psm2 mtl
mpirun —mca pml cm —mca mtl psm2 ...

- “indirectly” via libfabric
mpirun —mca pml cm —mca mtl ofi ...

I do invite you to try both. By explicitly requesting the mtl you will
avoid potential conflicts.

libfabric is used in production by Cisco and AWS (both major contributors
to both Open MPI and libfabric) so this is clearly not something to stay
away from. That being said, bug always happen and they could be related to
Open MPI, libfabric and/or OmniPath (and fwiw, Intel is a major contributor
to libfabric too)

Cheers,

Gilles

On Thursday, January 10, 2019, ROTHE Eduardo - externe <
eduardo-externe.ro...@edf.fr> wrote:

> Hi Gilles, thank you so much for your support!
>
> For now I'm just testing the software, so it's running on a single node.
>
> Your suggestion was very precise. In fact, choosing the ob1 component
> leads to a successfull execution! The tcp component had no effect.
>
> mpirun --mca pml ob1 —mca btl tcp,self -np 2 ./a.out *> Success*
> mpirun --mca pml ob1 -np 2 ./a.out *> Success*
>
> But... our cluster is equiped with Intel OMNI Path interconnects and we
> are aiming to use psm2 through ofi component in order to take full
> advantage of this technology.
>
> I believe your suggestion is showing that the problem is right here. But
> unfortunately I cannot see further.
>
> Meanwhile, I've also compiled Open MPI 3.1.3 and I have a successfull run
> with the same options and the same environment (no MPI_Send error). Could
> Open MPI 4.0.0 bring a different behaviour in this area? Eventually
> regarding ofi component?
>
> Do you have any idea that I could put in practice to narrow the problem
> further?
>
> Regards,
> Eduardo
>
> ps: I've recompiled Open MPI 4.0.0 using --with-hwloc=external, but with
> no different results (the same MPI_Send error);
>
> ps2: Yes, the configure line thing is really fishy, the original line was 
> --prefix=/opt/openmpi/4.0.0
> --with-pmix=/usr/lib/x86_64-linux-gnu/pmix --with-libevent=external
> --with-slurm --enable-mpi-cxx --with-ofi --with-verbs=no
> --disable-silent-rules --with-hwloc=/usr --enable-mpirun-prefix-by-default
> --with-devel-headers
>
>
> --
> *De :* users  de la part de
> gilles.gouaillar...@gmail.com 
> *Envoyé :* mercredi 9 janvier 2019 15:16
> *À :* Open MPI Users
> *Objet :* Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send
>
> Eduardo,
>
> The first part of the configure command line is for an install in /usr,
> but then there is ‘—prefix=/opt/openmpi/4.0.0’ and this is very fishy.
> You should also use ‘—with-hwloc=external’.
>
> How many nodes are you running on and which interconnect are you using ?
> What if you
> mpirun —mca pml ob1 —mca btl tcp,self -np 2 ./a.out
>
> Cheers,
>
> Gilles
>
> On Wednesday, January 9, 2019, ROTHE Eduardo - externe <
> eduardo-externe.ro...@edf.fr> wrote:
>
>> Hi.
>>
>> I'm testing Open MPI 4.0.0 and I'm struggling with a weird behaviour. In
>> a very simple example (very frustrating). I'm having the following error
>> returned by MPI_Send:
>>
>>
>>
>>
>>
>>
>> *  [gafront4:25692] *** An error occurred in MPI_Send
>> [gafront4:25692] *** reported by process [3152019457,0]
>> [gafront4:25692] *** on communicator MPI_COMM_WORLD
>> [gafront4:25692] *** MPI_ERR_OTHER: known error not in list
>> [gafront4:25692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,   [gafront4:25692] ***and potentially your MPI
>> job)*
>>
>> In the same machine I have other two instalations of Open MPI (2.0.2 and
>> 2.1.2) and they all run successfully this dummy program:
>>
>> #include 
>> #include 
>>
>> int main(int argc, char **argv) {
>> int process;
>> int population;
>>
>> MPI_Init(NULL, NULL);
>> MPI_Comm_rank(MPI_COMM_WORLD, );
>> MPI_Comm_size(MPI_COMM_WORLD, );
>> printf("Hello World from proccess %d out of %d\n", process,
>> population);
>>
>> int send_number = 10;
>> int recv_number;
>>
>> if (process == 0) {
>> MPI_Send(_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
>> printf("This is process 0 reporting::\n");
>> } else if (process == 1) {
>> MPI_Recv(_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
>> MPI_STATUS_IGNORE);
>> printf("Process 1 received number %d from process 0\n",
>> recv_number);
>> }
>>

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-10 Thread Peter Kjellström
On Thu, 10 Jan 2019 11:20:12 +
ROTHE Eduardo - externe  wrote:

> Hi Gilles, thank you so much for your support!
> 
> For now I'm just testing the software, so it's running on a single
> node.
> 
> Your suggestion was very precise. In fact, choosing the ob1 component
> leads to a successfull execution! The tcp component had no effect.
> 
> mpirun --mca pml ob1 —mca btl tcp,self -np 2 ./a.out > Success
> mpirun --mca pml ob1 -np 2 ./a.out > Success
> 
> But... our cluster is equiped with Intel OMNI Path interconnects and
> we are aiming to use psm2 through ofi component in order to take full
> advantage of this technology.

Ofi support in openmpi has been something to stay away from in my
experience. You should just use the psm2 mtl instead.

/Peter K
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-10 Thread ROTHE Eduardo - externe
Hi Gilles, thank you so much for your support!

For now I'm just testing the software, so it's running on a single node.

Your suggestion was very precise. In fact, choosing the ob1 component leads to 
a successfull execution! The tcp component had no effect.

mpirun --mca pml ob1 —mca btl tcp,self -np 2 ./a.out > Success
mpirun --mca pml ob1 -np 2 ./a.out > Success

But... our cluster is equiped with Intel OMNI Path interconnects and we are 
aiming to use psm2 through ofi component in order to take full advantage of 
this technology.

I believe your suggestion is showing that the problem is right here. But 
unfortunately I cannot see further.

Meanwhile, I've also compiled Open MPI 3.1.3 and I have a successfull run with 
the same options and the same environment (no MPI_Send error). Could Open MPI 
4.0.0 bring a different behaviour in this area? Eventually regarding ofi 
component?

Do you have any idea that I could put in practice to narrow the problem further?

Regards,
Eduardo

ps: I've recompiled Open MPI 4.0.0 using --with-hwloc=external, but with no 
different results (the same MPI_Send error);

ps2: Yes, the configure line thing is really fishy, the original line was 
--prefix=/opt/openmpi/4.0.0 --with-pmix=/usr/lib/x86_64-linux-gnu/pmix 
--with-libevent=external --with-slurm --enable-mpi-cxx --with-ofi 
--with-verbs=no --disable-silent-rules --with-hwloc=/usr 
--enable-mpirun-prefix-by-default --with-devel-headers



De : users  de la part de 
gilles.gouaillar...@gmail.com 
Envoyé : mercredi 9 janvier 2019 15:16
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

Eduardo,

The first part of the configure command line is for an install in /usr, but 
then there is ‘—prefix=/opt/openmpi/4.0.0’ and this is very fishy.
You should also use ‘—with-hwloc=external’.

How many nodes are you running on and which interconnect are you using ?
What if you
mpirun —mca pml ob1 —mca btl tcp,self -np 2 ./a.out

Cheers,

Gilles

On Wednesday, January 9, 2019, ROTHE Eduardo - externe 
mailto:eduardo-externe.ro...@edf.fr>> wrote:
Hi.

I'm testing Open MPI 4.0.0 and I'm struggling with a weird behaviour. In a very 
simple example (very frustrating). I'm having the following error returned by 
MPI_Send:

  [gafront4:25692] *** An error occurred in MPI_Send
  [gafront4:25692] *** reported by process [3152019457,0]
  [gafront4:25692] *** on communicator MPI_COMM_WORLD
  [gafront4:25692] *** MPI_ERR_OTHER: known error not in list
  [gafront4:25692] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
  [gafront4:25692] ***and potentially your MPI job)

In the same machine I have other two instalations of Open MPI (2.0.2 and 2.1.2) 
and they all run successfully this dummy program:

#include 
#include 

int main(int argc, char **argv) {
int process;
int population;

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("Hello World from proccess %d out of %d\n", process, population);

int send_number = 10;
int recv_number;

if (process == 0) {
MPI_Send(_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("This is process 0 reporting::\n");
} else if (process == 1) {
MPI_Recv(_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, 
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n", 
recv_number);
}

MPI_Finalize();
return 0;
}

I'm really upset about recurring to you with this problem. I've been arround it 
for days now and can't find any good solution. Can you please take a look? I've 
enabled FI_LOG_LEVEL=Debug to see if I can trap any information that could be 
of use but unfortunetly with no success. I've also googled a lot, but I don't 
see where this error message might be pointing at. Specially having two other 
working versions on the same machine. The thing is that I see no reason why 
this code shouldn't run.

The following is the configure command line, as given by ompi_info.

 Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
   '--includedir=${prefix}/include'
   '--mandir=${prefix}/share/man'
   '--infodir=${prefix}/share/info'
   '--sysconfdir=/etc' '--localstatedir=/var'
   '--disable-silent-rules'
   '--libdir=${prefix}/lib/x86_64-linux-gnu'
   '--libexecdir=${prefix}/lib/x86_64-linux-gnu'
   '--disable-maintainer-mode'
   '--disable-dependency-tracking'
   '--prefix=/opt/openmpi/4.0.0'
   

Re: [OMPI users] Open MPI 4.0.0 - error with MPI_Send

2019-01-09 Thread Gilles Gouaillardet
Eduardo,

The first part of the configure command line is for an install in /usr, but
then there is ‘—prefix=/opt/openmpi/4.0.0’ and this is very fishy.
You should also use ‘—with-hwloc=external’.

How many nodes are you running on and which interconnect are you using ?
What if you
mpirun —mca pml ob1 —mca btl tcp,self -np 2 ./a.out

Cheers,

Gilles

On Wednesday, January 9, 2019, ROTHE Eduardo - externe <
eduardo-externe.ro...@edf.fr> wrote:

> Hi.
>
> I'm testing Open MPI 4.0.0 and I'm struggling with a weird behaviour. In a
> very simple example (very frustrating). I'm having the following error
> returned by MPI_Send:
>
>
>
>
>
>
> *  [gafront4:25692] *** An error occurred in MPI_Send
> [gafront4:25692] *** reported by process [3152019457,0]
> [gafront4:25692] *** on communicator MPI_COMM_WORLD
> [gafront4:25692] *** MPI_ERR_OTHER: known error not in list
> [gafront4:25692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,   [gafront4:25692] ***and potentially your MPI
> job)*
>
> In the same machine I have other two instalations of Open MPI (2.0.2 and
> 2.1.2) and they all run successfully this dummy program:
>
> #include 
> #include 
>
> int main(int argc, char **argv) {
> int process;
> int population;
>
> MPI_Init(NULL, NULL);
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> printf("Hello World from proccess %d out of %d\n", process,
> population);
>
> int send_number = 10;
> int recv_number;
>
> if (process == 0) {
> MPI_Send(_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
> printf("This is process 0 reporting::\n");
> } else if (process == 1) {
> MPI_Recv(_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE);
> printf("Process 1 received number %d from process 0\n",
> recv_number);
> }
>
> MPI_Finalize();
> return 0;
> }
>
> I'm really upset about recurring to you with this problem. I've been
> arround it for days now and can't find any good solution. Can you please
> take a look? I've enabled *FI_LOG_LEVEL=Debug* to see if I can trap any
> information that could be of use but unfortunetly with no success. I've
> also googled a lot, but I don't see where this error message might be
> pointing at. Specially having two other working versions on the same
> machine. The thing is that I see no reason why this code shouldn't run.
>
> The following is the configure command line, as given by ompi_info.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * Configure command line: '--build=x86_64-linux-gnu'
> '--prefix=/usr'
> '--includedir=${prefix}/include'
> '--mandir=${prefix}/share/man'
> '--infodir=${prefix}/share/info'
> '--sysconfdir=/etc' '--localstatedir=/var'
> '--disable-silent-rules'
> '--libdir=${prefix}/lib/x86_64-linux-gnu'
> '--libexecdir=${prefix}/lib/x86_64-linux-gnu'
>  '--disable-maintainer-mode'
> '--disable-dependency-tracking'
> '--prefix=/opt/openmpi/4.0.0'
> '--with-pmix=/usr/lib/x86_64-linux-gnu/pmix'
> '--with-libevent=external' '--with-slurm'
> '--enable-mpi-cxx' '--with-ofi' '--with-verbs=no'
>   '--disable-silent-rules' '--with-hwloc=/usr'
>   '--enable-mpirun-prefix-by-default'
>  '--with-devel-headers'*
>
> Thank you for your time.
> Regards,
> Ed
>
>
>
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
> établis à l'intention exclusive des destinataires et les informations qui y
> figurent sont strictement confidentielles. Toute utilisation de ce Message
> non conforme à sa destination, toute diffusion ou toute publication totale
> ou partielle, est interdite sauf autorisation expresse.
>
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
> sur quelque support que ce soit. Nous vous remercions également d'en
> avertir immédiatement l'expéditeur par retour du message.
>
> Il est impossible de garantir que les communications par messagerie
> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
> erreur ou virus.
> 
>
> This message and any attachments (the 'Message') are intended solely for
> the addressees. The information contained in this Message is confidential.
> Any use of information contained in this Message not in accord with its
> purpose, any dissemination or disclosure, either whole or partial, is
> prohibited except formal approval.
>
> If you are not the addressee, you may not copy, forward, disclose or use
> any part of it. If you have received this message in error, please delete
> it and all copies from your system and notify the sender