Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread George Bosilca
Sorry, I missed the 4.0 on the PR (despite being the first thing in the title).

  George.


> On Sep 20, 2018, at 22:15 , Ralph H Castain  wrote:
> 
> That’s why we are leaving it in master - only removing it from release branch 
> 
> Sent from my iPhone
> 
> On Sep 20, 2018, at 7:02 PM, George Bosilca  > wrote:
> 
>> Why not simply ompi_ignore it ? Removing a component to bring it back later 
>> would force us to lose all history. I would a rather add an .ompi_ignore and 
>> give an opportunity to power users do continue playing with it.
>> 
>>   George.
>> 
>> 
>> On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain > > wrote:
>> I already suggested the configure option, but it doesn’t solve the problem. 
>> I wouldn’t be terribly surprised to find that Cray also has an undetected 
>> problem given the nature of the issue - just a question of the amount of 
>> testing, variety of environments, etc.
>> 
>> Nobody has to wait for the next major release, though that isn’t so far off 
>> anyway - there has never been an issue with bringing in a new component 
>> during a release series.
>> 
>> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want 
>> to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>> 
>> Ralph
>> 
>> 
>>> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon 
>>> mailto:tpati...@vols.utk.edu>> wrote:
>>> 
>>> I understand and agree with your point. My initial email is just out of 
>>> curiosity.
>>> 
>>> Howard tested this BTL for Cray in the summer as well. So this seems to 
>>> only affected OPA hardware.
>>> 
>>> I just remember that in the summer, I have to make some change in libpsm2 
>>> to get this BTL to work for OPA.  Maybe this is the problem as the default 
>>> libpsm2 won't work.
>>> 
>>> So maybe we can fix this in configure step to detect version of libpsm2 and 
>>> dont build if we are not satisfied.
>>> 
>>> Another idea is maybe we dont build this BTL by default. So the user with 
>>> Cray hardware can still use it if they want. (Just rebuild with the btl)  - 
>>> We just need to verify if it still works on Cray.  This way, OFI 
>>> stakeholders does not have to wait until next major release to get this in.
>>> 
>>> 
>>> Arm
>>> 
>>> 
>>> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain >> > wrote:
>>> I suspect it is a question of what you tested and in which scenarios. 
>>> Problem is that it can bite someone and there isn’t a clean/obvious 
>>> solution that doesn’t require the user to do something - e.g., like having 
>>> to know that they need to disable a BTL. Matias has proposed an mca-based 
>>> approach, but I would much rather we just fix this correctly. Bandaids have 
>>> a habit of becoming permanently forgotten - until someone pulls on it and 
>>> things unravel.
>>> 
>>> 
 On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
 mailto:tpati...@vols.utk.edu>> wrote:
 
 In the summer, I tested this BTL with along with the MTL and able to use 
 both of them interchangeably with no problem. I dont know what changed. 
 libpsm2?
 
 
 Arm
 
 
 On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain >>> > wrote:
 We have too many discussion threads overlapping on the same email chain - 
 so let’s break the discussion on the OFI problem into its own chain.
 
 We have been investigating this locally and found there are a number of 
 conflicts between the MTLs and the OFI/BTL stepping on each other. The 
 correct solution is to move endpoint creation/reporting into a the 
 opal/mca/common area, but that is going to take some work and will likely 
 impact release schedules.
 
 Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix 
 the problem in master, and then consider bringing it back as a package to 
 v4.1 or v4.2.
 
 Comments? If we agree, I’ll file a PR to remove it.
 Ralph
 
 
> Begin forwarded message:
> 
> From: Peter Kjellström mailto:c...@nsc.liu.se>>
> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
> Date: September 20, 2018 at 5:18:35 AM PDT
> To: "Gabriel, Edgar"  >
> Cc: Open MPI Developers  >
> Reply-To: Open MPI Developers  >
> 
> On Wed, 19 Sep 2018 16:24:53 +
> "Gabriel, Edgar"  > wrote:
> 
>> I performed some tests on our Omnipath cluster, and I have a mixed
>> bag of results with 4.0.0rc1
> 
> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
> very similar results.
> 
>> compute-1-1.local.4351PSM2 has not been initialized
>> compute-1-0.local.3826PSM2 has not been initialized
> 
> yup I too see these.
> 
>> mpirun 

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Ralph H Castain
That’s why we are leaving it in master - only removing it from release branch 

Sent from my iPhone

> On Sep 20, 2018, at 7:02 PM, George Bosilca  wrote:
> 
> Why not simply ompi_ignore it ? Removing a component to bring it back later 
> would force us to lose all history. I would a rather add an .ompi_ignore and 
> give an opportunity to power users do continue playing with it.
> 
>   George.
> 
> 
>> On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain  wrote:
>> I already suggested the configure option, but it doesn’t solve the problem. 
>> I wouldn’t be terribly surprised to find that Cray also has an undetected 
>> problem given the nature of the issue - just a question of the amount of 
>> testing, variety of environments, etc.
>> 
>> Nobody has to wait for the next major release, though that isn’t so far off 
>> anyway - there has never been an issue with bringing in a new component 
>> during a release series.
>> 
>> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want 
>> to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>> 
>> Ralph
>> 
>> 
>>> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon 
>>>  wrote:
>>> 
>>> I understand and agree with your point. My initial email is just out of 
>>> curiosity.
>>> 
>>> Howard tested this BTL for Cray in the summer as well. So this seems to 
>>> only affected OPA hardware.
>>> 
>>> I just remember that in the summer, I have to make some change in libpsm2 
>>> to get this BTL to work for OPA.  Maybe this is the problem as the default 
>>> libpsm2 won't work.
>>> 
>>> So maybe we can fix this in configure step to detect version of libpsm2 and 
>>> dont build if we are not satisfied.
>>> 
>>> Another idea is maybe we dont build this BTL by default. So the user with 
>>> Cray hardware can still use it if they want. (Just rebuild with the btl)  - 
>>> We just need to verify if it still works on Cray.  This way, OFI 
>>> stakeholders does not have to wait until next major release to get this in.
>>> 
>>> 
>>> Arm
>>> 
>>> 
 On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain  wrote:
 I suspect it is a question of what you tested and in which scenarios. 
 Problem is that it can bite someone and there isn’t a clean/obvious 
 solution that doesn’t require the user to do something - e.g., like having 
 to know that they need to disable a BTL. Matias has proposed an mca-based 
 approach, but I would much rather we just fix this correctly. Bandaids 
 have a habit of becoming permanently forgotten - until someone pulls on it 
 and things unravel.
 
 
> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
>  wrote:
> 
> In the summer, I tested this BTL with along with the MTL and able to use 
> both of them interchangeably with no problem. I dont know what changed. 
> libpsm2?
> 
> 
> Arm
> 
> 
>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  wrote:
>> We have too many discussion threads overlapping on the same email chain 
>> - so let’s break the discussion on the OFI problem into its own chain.
>> 
>> We have been investigating this locally and found there are a number of 
>> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
>> correct solution is to move endpoint creation/reporting into a the 
>> opal/mca/common area, but that is going to take some work and will 
>> likely impact release schedules.
>> 
>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix 
>> the problem in master, and then consider bringing it back as a package 
>> to v4.1 or v4.2.
>> 
>> Comments? If we agree, I’ll file a PR to remove it.
>> Ralph
>> 
>> 
>>> Begin forwarded message:
>>> 
>>> From: Peter Kjellström 
>>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>>> Date: September 20, 2018 at 5:18:35 AM PDT
>>> To: "Gabriel, Edgar" 
>>> Cc: Open MPI Developers 
>>> Reply-To: Open MPI Developers 
>>> 
>>> On Wed, 19 Sep 2018 16:24:53 +
>>> "Gabriel, Edgar"  wrote:
>>> 
 I performed some tests on our Omnipath cluster, and I have a mixed
 bag of results with 4.0.0rc1
>>> 
>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>> very similar results.
>>> 
 compute-1-1.local.4351PSM2 has not been initialized
 compute-1-0.local.3826PSM2 has not been initialized
>>> 
>>> yup I too see these.
>>> 
 mpirun detected that one or more processes exited with non-zero
 status, thus causing the job to be terminated. The first process to
 do so was:
 
  Process name: [[38418,1],1]
  Exit code:255
  
 
>>> 
>>> yup.
>>> 
 
 2. 

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread George Bosilca
Why not simply ompi_ignore it ? Removing a component to bring it back later
would force us to lose all history. I would a rather add an .ompi_ignore
and give an opportunity to power users do continue playing with it.

  George.


On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain  wrote:

> I already suggested the configure option, but it doesn’t solve the
> problem. I wouldn’t be terribly surprised to find that Cray also has an
> undetected problem given the nature of the issue - just a question of the
> amount of testing, variety of environments, etc.
>
> Nobody has to wait for the next major release, though that isn’t so far
> off anyway - there has never been an issue with bringing in a new component
> during a release series.
>
> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may
> want to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>
> Ralph
>
>
> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon <
> tpati...@vols.utk.edu> wrote:
>
> I understand and agree with your point. My initial email is just out of
> curiosity.
>
> Howard tested this BTL for Cray in the summer as well. So this seems to
> only affected OPA hardware.
>
> I just remember that in the summer, I have to make some change in libpsm2
> to get this BTL to work for OPA.  Maybe this is the problem as the default
> libpsm2 won't work.
>
> So maybe we can fix this in configure step to detect version of libpsm2
> and dont build if we are not satisfied.
>
> Another idea is maybe we dont build this BTL by default. So the user with
> Cray hardware can still use it if they want. (Just rebuild with the btl)  -
> We just need to verify if it still works on Cray.  This way, OFI
> stakeholders does not have to wait until next major release to get this in.
>
>
> Arm
>
>
> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain  wrote:
>
>> I suspect it is a question of what you tested and in which scenarios.
>> Problem is that it can bite someone and there isn’t a clean/obvious
>> solution that doesn’t require the user to do something - e.g., like having
>> to know that they need to disable a BTL. Matias has proposed an mca-based
>> approach, but I would much rather we just fix this correctly. Bandaids have
>> a habit of becoming permanently forgotten - until someone pulls on it and
>> things unravel.
>>
>>
>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon <
>> tpati...@vols.utk.edu> wrote:
>>
>> In the summer, I tested this BTL with along with the MTL and able to use
>> both of them interchangeably with no problem. I dont know what changed.
>> libpsm2?
>>
>>
>> Arm
>>
>>
>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  wrote:
>>
>>> We have too many discussion threads overlapping on the same email chain
>>> - so let’s break the discussion on the OFI problem into its own chain.
>>>
>>> We have been investigating this locally and found there are a number of
>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The
>>> correct solution is to move endpoint creation/reporting into a the
>>> opal/mca/common area, but that is going to take some work and will likely
>>> impact release schedules.
>>>
>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
>>> the problem in master, and then consider bringing it back as a package to
>>> v4.1 or v4.2.
>>>
>>> Comments? If we agree, I’ll file a PR to remove it.
>>> Ralph
>>>
>>>
>>> Begin forwarded message:
>>>
>>> *From: *Peter Kjellström 
>>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
>>> *Date: *September 20, 2018 at 5:18:35 AM PDT
>>> *To: *"Gabriel, Edgar" 
>>> *Cc: *Open MPI Developers 
>>> *Reply-To: *Open MPI Developers 
>>>
>>> On Wed, 19 Sep 2018 16:24:53 +
>>> "Gabriel, Edgar"  wrote:
>>>
>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>> bag of results with 4.0.0rc1
>>>
>>>
>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>> very similar results.
>>>
>>> compute-1-1.local.4351PSM2 has not been initialized
>>> compute-1-0.local.3826PSM2 has not been initialized
>>>
>>>
>>> yup I too see these.
>>>
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to
>>> do so was:
>>>
>>>  Process name: [[38418,1],1]
>>>  Exit code:255
>>>
>>>  
>>> 
>>>
>>>
>>> yup.
>>>
>>>
>>> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>> error message.
>>>
>>>
>>> Yes ofi seems broken. But not even disabling it helps me completely (I
>>> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
>>> perf top...
>>>
>>> 3.   The openib btl component is always getting in the way with
>>> annoying warnings. It is not really used, but constantly complains:
>>>
>>> ...
>>>

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Ralph H Castain
I already suggested the configure option, but it doesn’t solve the problem. I 
wouldn’t be terribly surprised to find that Cray also has an undetected problem 
given the nature of the issue - just a question of the amount of testing, 
variety of environments, etc.

Nobody has to wait for the next major release, though that isn’t so far off 
anyway - there has never been an issue with bringing in a new component during 
a release series.

Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want to 
look at fixing the osc/rdma/ofi bandaid as well while we are at it.

Ralph


> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon 
>  wrote:
> 
> I understand and agree with your point. My initial email is just out of 
> curiosity.
> 
> Howard tested this BTL for Cray in the summer as well. So this seems to only 
> affected OPA hardware.
> 
> I just remember that in the summer, I have to make some change in libpsm2 to 
> get this BTL to work for OPA.  Maybe this is the problem as the default 
> libpsm2 won't work.
> 
> So maybe we can fix this in configure step to detect version of libpsm2 and 
> dont build if we are not satisfied.
> 
> Another idea is maybe we dont build this BTL by default. So the user with 
> Cray hardware can still use it if they want. (Just rebuild with the btl)  - 
> We just need to verify if it still works on Cray.  This way, OFI stakeholders 
> does not have to wait until next major release to get this in.
> 
> 
> Arm
> 
> 
> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain  > wrote:
> I suspect it is a question of what you tested and in which scenarios. Problem 
> is that it can bite someone and there isn’t a clean/obvious solution that 
> doesn’t require the user to do something - e.g., like having to know that 
> they need to disable a BTL. Matias has proposed an mca-based approach, but I 
> would much rather we just fix this correctly. Bandaids have a habit of 
> becoming permanently forgotten - until someone pulls on it and things unravel.
> 
> 
>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
>> mailto:tpati...@vols.utk.edu>> wrote:
>> 
>> In the summer, I tested this BTL with along with the MTL and able to use 
>> both of them interchangeably with no problem. I dont know what changed. 
>> libpsm2?
>> 
>> 
>> Arm
>> 
>> 
>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain > > wrote:
>> We have too many discussion threads overlapping on the same email chain - so 
>> let’s break the discussion on the OFI problem into its own chain.
>> 
>> We have been investigating this locally and found there are a number of 
>> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
>> correct solution is to move endpoint creation/reporting into a the 
>> opal/mca/common area, but that is going to take some work and will likely 
>> impact release schedules.
>> 
>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the 
>> problem in master, and then consider bringing it back as a package to v4.1 
>> or v4.2.
>> 
>> Comments? If we agree, I’ll file a PR to remove it.
>> Ralph
>> 
>> 
>>> Begin forwarded message:
>>> 
>>> From: Peter Kjellström mailto:c...@nsc.liu.se>>
>>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>>> Date: September 20, 2018 at 5:18:35 AM PDT
>>> To: "Gabriel, Edgar" >> >
>>> Cc: Open MPI Developers >> >
>>> Reply-To: Open MPI Developers >> >
>>> 
>>> On Wed, 19 Sep 2018 16:24:53 +
>>> "Gabriel, Edgar" mailto:egabr...@central.uh.edu>> 
>>> wrote:
>>> 
 I performed some tests on our Omnipath cluster, and I have a mixed
 bag of results with 4.0.0rc1
>>> 
>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>> very similar results.
>>> 
 compute-1-1.local.4351PSM2 has not been initialized
 compute-1-0.local.3826PSM2 has not been initialized
>>> 
>>> yup I too see these.
>>> 
 mpirun detected that one or more processes exited with non-zero
 status, thus causing the job to be terminated. The first process to
 do so was:
 
  Process name: [[38418,1],1]
  Exit code:255
  
 
>>> 
>>> yup.
>>> 
 
 2.   The ofi mtl does not work at all on our Omnipath cluster. If
 I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
 error message.
>>> 
>>> Yes ofi seems broken. But not even disabling it helps me completely (I
>>> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
>>> perf top...
>>> 
 3.   The openib btl component is always getting in the way with
 annoying warnings. It is not really used, but constantly complains:
>>> ...
 [sabine.cacds.uh.edu:25996 ] 1 more 
 process has 

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Patinyasakdikul, Thananon
I understand and agree with your point. My initial email is just out of
curiosity.

Howard tested this BTL for Cray in the summer as well. So this seems to
only affected OPA hardware.

I just remember that in the summer, I have to make some change in libpsm2
to get this BTL to work for OPA.  Maybe this is the problem as the default
libpsm2 won't work.

So maybe we can fix this in configure step to detect version of libpsm2 and
dont build if we are not satisfied.

Another idea is maybe we dont build this BTL by default. So the user with
Cray hardware can still use it if they want. (Just rebuild with the btl)  -
We just need to verify if it still works on Cray.  This way, OFI
stakeholders does not have to wait until next major release to get this in.


Arm


On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain  wrote:

> I suspect it is a question of what you tested and in which scenarios.
> Problem is that it can bite someone and there isn’t a clean/obvious
> solution that doesn’t require the user to do something - e.g., like having
> to know that they need to disable a BTL. Matias has proposed an mca-based
> approach, but I would much rather we just fix this correctly. Bandaids have
> a habit of becoming permanently forgotten - until someone pulls on it and
> things unravel.
>
>
> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon <
> tpati...@vols.utk.edu> wrote:
>
> In the summer, I tested this BTL with along with the MTL and able to use
> both of them interchangeably with no problem. I dont know what changed.
> libpsm2?
>
>
> Arm
>
>
> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  wrote:
>
>> We have too many discussion threads overlapping on the same email chain -
>> so let’s break the discussion on the OFI problem into its own chain.
>>
>> We have been investigating this locally and found there are a number of
>> conflicts between the MTLs and the OFI/BTL stepping on each other. The
>> correct solution is to move endpoint creation/reporting into a the
>> opal/mca/common area, but that is going to take some work and will likely
>> impact release schedules.
>>
>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
>> the problem in master, and then consider bringing it back as a package to
>> v4.1 or v4.2.
>>
>> Comments? If we agree, I’ll file a PR to remove it.
>> Ralph
>>
>>
>> Begin forwarded message:
>>
>> *From: *Peter Kjellström 
>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
>> *Date: *September 20, 2018 at 5:18:35 AM PDT
>> *To: *"Gabriel, Edgar" 
>> *Cc: *Open MPI Developers 
>> *Reply-To: *Open MPI Developers 
>>
>> On Wed, 19 Sep 2018 16:24:53 +
>> "Gabriel, Edgar"  wrote:
>>
>> I performed some tests on our Omnipath cluster, and I have a mixed
>> bag of results with 4.0.0rc1
>>
>>
>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>> very similar results.
>>
>> compute-1-1.local.4351PSM2 has not been initialized
>> compute-1-0.local.3826PSM2 has not been initialized
>>
>>
>> yup I too see these.
>>
>> mpirun detected that one or more processes exited with non-zero
>> status, thus causing the job to be terminated. The first process to
>> do so was:
>>
>>  Process name: [[38418,1],1]
>>  Exit code:255
>>
>>  
>> 
>>
>>
>> yup.
>>
>>
>> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>> error message.
>>
>>
>> Yes ofi seems broken. But not even disabling it helps me completely (I
>> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
>> perf top...
>>
>> 3.   The openib btl component is always getting in the way with
>> annoying warnings. It is not really used, but constantly complains:
>>
>> ...
>>
>> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
>> help-mpi-btl-openib.txt / ib port not selected
>>
>>
>> Yup.
>>
>> ...
>>
>> So bottom line, if I do
>>
>> mpirun –mca btl^openib –mca mtl^ofi ….
>>
>> my tests finish correctly, although mpirun will still return an error.
>>
>>
>> I get some things to work with this approach (two ranks on two nodes
>> for example). But a lot of things crash rahter hard:
>>
>> $ mpirun -mca btl ^openib -mca mtl
>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>> --
>> PSM2 was unable to open an endpoint. Please make sure that the network
>> link is active on the node and the hardware is functioning.
>>
>>  Error: Failure in initializing endpoint
>> --
>> n909.279895hfi_userinit: assign_context command failed: Device or
>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>> trying again (1/3)
>> ...
>>  PML add procs failed
>>  --> Returned "Error" (-1) instead of "Success" (0)
>> 

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Ralph H Castain
I suspect it is a question of what you tested and in which scenarios. Problem 
is that it can bite someone and there isn’t a clean/obvious solution that 
doesn’t require the user to do something - e.g., like having to know that they 
need to disable a BTL. Matias has proposed an mca-based approach, but I would 
much rather we just fix this correctly. Bandaids have a habit of becoming 
permanently forgotten - until someone pulls on it and things unravel.


> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
>  wrote:
> 
> In the summer, I tested this BTL with along with the MTL and able to use both 
> of them interchangeably with no problem. I dont know what changed. libpsm2?
> 
> 
> Arm
> 
> 
> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  > wrote:
> We have too many discussion threads overlapping on the same email chain - so 
> let’s break the discussion on the OFI problem into its own chain.
> 
> We have been investigating this locally and found there are a number of 
> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
> correct solution is to move endpoint creation/reporting into a the 
> opal/mca/common area, but that is going to take some work and will likely 
> impact release schedules.
> 
> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the 
> problem in master, and then consider bringing it back as a package to v4.1 or 
> v4.2.
> 
> Comments? If we agree, I’ll file a PR to remove it.
> Ralph
> 
> 
>> Begin forwarded message:
>> 
>> From: Peter Kjellström mailto:c...@nsc.liu.se>>
>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>> Date: September 20, 2018 at 5:18:35 AM PDT
>> To: "Gabriel, Edgar" > >
>> Cc: Open MPI Developers > >
>> Reply-To: Open MPI Developers > >
>> 
>> On Wed, 19 Sep 2018 16:24:53 +
>> "Gabriel, Edgar" mailto:egabr...@central.uh.edu>> 
>> wrote:
>> 
>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>> bag of results with 4.0.0rc1
>> 
>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>> very similar results.
>> 
>>> compute-1-1.local.4351PSM2 has not been initialized
>>> compute-1-0.local.3826PSM2 has not been initialized
>> 
>> yup I too see these.
>> 
>>> mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to
>>> do so was:
>>> 
>>>  Process name: [[38418,1],1]
>>>  Exit code:255
>>>  
>>> 
>> 
>> yup.
>> 
>>> 
>>> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>> error message.
>> 
>> Yes ofi seems broken. But not even disabling it helps me completely (I
>> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
>> perf top...
>> 
>>> 3.   The openib btl component is always getting in the way with
>>> annoying warnings. It is not really used, but constantly complains:
>> ...
>>> [sabine.cacds.uh.edu:25996 ] 1 more 
>>> process has sent help message
>>> help-mpi-btl-openib.txt / ib port not selected
>> 
>> Yup.
>> 
>> ...
>>> So bottom line, if I do
>>> 
>>> mpirun –mca btl^openib –mca mtl^ofi ….
>>> 
>>> my tests finish correctly, although mpirun will still return an error.
>> 
>> I get some things to work with this approach (two ranks on two nodes
>> for example). But a lot of things crash rahter hard:
>> 
>> $ mpirun -mca btl ^openib -mca mtl
>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>> --
>> PSM2 was unable to open an endpoint. Please make sure that the network
>> link is active on the node and the hardware is functioning.
>> 
>>  Error: Failure in initializing endpoint
>> --
>> n909.279895hfi_userinit: assign_context command failed: Device or
>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>> trying again (1/3)
>> ...
>>  PML add procs failed
>>  --> Returned "Error" (-1) instead of "Success" (0)
>> --
>> [n908:298761] *** An error occurred in MPI_Init
>> [n908:298761] *** reported by process [4092002305,59]
>> [n908:298761] *** on a NULL communicator
>> [n908:298761] *** Unknown error
>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>  will now abort, [n908:298761] ***and potentially your MPI job)
>> [n907:407748] 255 more processes have sent help message
>>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>>  messages [n907:407748] 127 more 

Re: [OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Patinyasakdikul, Thananon
In the summer, I tested this BTL with along with the MTL and able to use
both of them interchangeably with no problem. I dont know what changed.
libpsm2?


Arm


On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain  wrote:

> We have too many discussion threads overlapping on the same email chain -
> so let’s break the discussion on the OFI problem into its own chain.
>
> We have been investigating this locally and found there are a number of
> conflicts between the MTLs and the OFI/BTL stepping on each other. The
> correct solution is to move endpoint creation/reporting into a the
> opal/mca/common area, but that is going to take some work and will likely
> impact release schedules.
>
> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
> the problem in master, and then consider bringing it back as a package to
> v4.1 or v4.2.
>
> Comments? If we agree, I’ll file a PR to remove it.
> Ralph
>
>
> Begin forwarded message:
>
> *From: *Peter Kjellström 
> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
> *Date: *September 20, 2018 at 5:18:35 AM PDT
> *To: *"Gabriel, Edgar" 
> *Cc: *Open MPI Developers 
> *Reply-To: *Open MPI Developers 
>
> On Wed, 19 Sep 2018 16:24:53 +
> "Gabriel, Edgar"  wrote:
>
> I performed some tests on our Omnipath cluster, and I have a mixed
> bag of results with 4.0.0rc1
>
>
> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
> very similar results.
>
> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized
>
>
> yup I too see these.
>
> mpirun detected that one or more processes exited with non-zero
> status, thus causing the job to be terminated. The first process to
> do so was:
>
>  Process name: [[38418,1],1]
>  Exit code:255
>
>  
> 
>
>
> yup.
>
>
> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
> error message.
>
>
> Yes ofi seems broken. But not even disabling it helps me completely (I
> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
> perf top...
>
> 3.   The openib btl component is always getting in the way with
> annoying warnings. It is not really used, but constantly complains:
>
> ...
>
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
> help-mpi-btl-openib.txt / ib port not selected
>
>
> Yup.
>
> ...
>
> So bottom line, if I do
>
> mpirun –mca btl^openib –mca mtl^ofi ….
>
> my tests finish correctly, although mpirun will still return an error.
>
>
> I get some things to work with this approach (two ranks on two nodes
> for example). But a lot of things crash rahter hard:
>
> $ mpirun -mca btl ^openib -mca mtl
> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
> --
> PSM2 was unable to open an endpoint. Please make sure that the network
> link is active on the node and the hardware is functioning.
>
>  Error: Failure in initializing endpoint
> --
> n909.279895hfi_userinit: assign_context command failed: Device or
> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
> trying again (1/3)
> ...
>  PML add procs failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --
> [n908:298761] *** An error occurred in MPI_Init
> [n908:298761] *** reported by process [4092002305,59]
> [n908:298761] *** on a NULL communicator
> [n908:298761] *** Unknown error
> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>  will now abort, [n908:298761] ***and potentially your MPI job)
> [n907:407748] 255 more processes have sent help message
>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>  messages [n907:407748] 127 more processes have sent help message
>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>  [n907:407748] 56 more processes have sent help message
>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>
> If I disable psm2 too I get it to run (apparantly on vader?)
>
> /Peter K
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] OFI issues on Open MPI v4.0.0rc1

2018-09-20 Thread Ralph H Castain
We have too many discussion threads overlapping on the same email chain - so 
let’s break the discussion on the OFI problem into its own chain.

We have been investigating this locally and found there are a number of 
conflicts between the MTLs and the OFI/BTL stepping on each other. The correct 
solution is to move endpoint creation/reporting into a the opal/mca/common 
area, but that is going to take some work and will likely impact release 
schedules.

Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix the 
problem in master, and then consider bringing it back as a package to v4.1 or 
v4.2.

Comments? If we agree, I’ll file a PR to remove it.
Ralph


> Begin forwarded message:
> 
> From: Peter Kjellström 
> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
> Date: September 20, 2018 at 5:18:35 AM PDT
> To: "Gabriel, Edgar" 
> Cc: Open MPI Developers 
> Reply-To: Open MPI Developers 
> 
> On Wed, 19 Sep 2018 16:24:53 +
> "Gabriel, Edgar"  wrote:
> 
>> I performed some tests on our Omnipath cluster, and I have a mixed
>> bag of results with 4.0.0rc1
> 
> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
> very similar results.
> 
>> compute-1-1.local.4351PSM2 has not been initialized
>> compute-1-0.local.3826PSM2 has not been initialized
> 
> yup I too see these.
> 
>> mpirun detected that one or more processes exited with non-zero
>> status, thus causing the job to be terminated. The first process to
>> do so was:
>> 
>>  Process name: [[38418,1],1]
>>  Exit code:255
>>  
>> 
> 
> yup.
> 
>> 
>> 2.   The ofi mtl does not work at all on our Omnipath cluster. If
>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>> error message.
> 
> Yes ofi seems broken. But not even disabling it helps me completely (I
> see "mca_btl_ofi.so   [.] mca_btl_ofi_component_progress" in my
> perf top...
> 
>> 3.   The openib btl component is always getting in the way with
>> annoying warnings. It is not really used, but constantly complains:
> ...
>> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
>> help-mpi-btl-openib.txt / ib port not selected
> 
> Yup.
> 
> ...
>> So bottom line, if I do
>> 
>> mpirun –mca btl^openib –mca mtl^ofi ….
>> 
>> my tests finish correctly, although mpirun will still return an error.
> 
> I get some things to work with this approach (two ranks on two nodes
> for example). But a lot of things crash rahter hard:
> 
> $ mpirun -mca btl ^openib -mca mtl
> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
> --
> PSM2 was unable to open an endpoint. Please make sure that the network
> link is active on the node and the hardware is functioning.
> 
>  Error: Failure in initializing endpoint
> --
> n909.279895hfi_userinit: assign_context command failed: Device or
> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
> trying again (1/3)
> ...
>  PML add procs failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --
> [n908:298761] *** An error occurred in MPI_Init
> [n908:298761] *** reported by process [4092002305,59]
> [n908:298761] *** on a NULL communicator
> [n908:298761] *** Unknown error
> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>  will now abort, [n908:298761] ***and potentially your MPI job)
> [n907:407748] 255 more processes have sent help message
>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>  messages [n907:407748] 127 more processes have sent help message
>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>  [n907:407748] 56 more processes have sent help message
>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> 
> If I disable psm2 too I get it to run (apparantly on vader?)
> 
> /Peter K
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel