[OMPI users] No network interfaces were found for out-of-band communications.

2018-09-11 Thread Greg Russell
I have a single machine w 96 cores.  It runs CentOS7 and is not connected to 
any network as it needs to isolated for security.


I attempted the standard install process and upon attempting to run ./mpirun I 
find the error message


"No network interfaces were found for out-of-band communications. We require at 
least one available network for out-of-band messaging."


I'm a rookie with openMPI so I'm guessing maybe some configuration flags might 
fix the whole problem?  Any ideas are very much appreciated.


Thank you,

Russell
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] *** Error in `orted': double free or corruption (out): 0x00002aaab4001680 ***, in some node combos.

2018-09-11 Thread Jeff Squyres (jsquyres) via users
Thanks for reporting the issue.

First, you can workaround the issue by using:

mpirun --mca oob tcp ...

This uses a different out-of-band plugin (TCP) instead of verbs unreliable 
datagrams.

Second, I just filed a fix for our current release branches (v2.1.x, v3.0.x, 
and v3.1.x):

https://github.com/open-mpi/ompi/issues/5672

Could you try it out and let me know if it works for you?

Thanks!


> On Sep 10, 2018, at 5:36 PM, Balazs HAJGATO  wrote:
> 
> Dear list readers,
> 
> I have some problems with OpenMPI 3.1.1. In some node combos, I got the error 
> (libibverbs: GRH is mandatory For RoCE address handle; *** Error in 
> `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted':
>  double free or corruption (out): 0x2aaab4001680 ***), see details in 
> file 114_151.out.bz2, even with the most simplest run, like
> mpirun -host nic114,nic151 hostname
> In the file 114_151.out.bz2, you can see the output if I run the command from 
> nic114. If I run the same command from nic151, it simply spits out the 
> hostnames, without any errors. 
> 
> I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is 
> identical, see ompi.nic114.bz2). I do not have the config.log file, although 
> I still have the config output (see confilg.out.bz2). The nodes have 
> identical opsystems (as we use the same image), and the OpenMPI is also 
> loaded from a central directory shared amongst the nodes. We have an 
> infiniband network (with IP over IB) and an ethernet network. Intel MPI works 
> without a problem, and I confirmed that the network is IB when I use the 
> Intel MPI) It is not clear whether the orted error is the consequence of the 
> libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. 
> (ibv_devinfo is also attached, we do have a somewhat creative infiniband 
> topology, based on fat-tree, but changing the topology did not solved the 
> problem). The /tmp directory is writable, and not full. As a matter of fact, 
> I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get 
> this error in case of OpenMPI
  1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?
> 
> Regards,
> 
> Balazs Hajgato
> <114_151.out.bz2>___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

2018-09-11 Thread Jeff Hammond
Are you trying to run UPC++ over MPI in the cloud?

Jeff

On Tue, Sep 11, 2018 at 10:46 AM, Benjamin Brock 
wrote:

> Thanks for your response.
>
> One question: why would RoCE still requiring host processing of every
> packet? I thought the point was that some nice server Ethernet NICs can
> handle RDMA requests directly?  Or am I misunderstanding RoCE/how Open
> MPI's RoCE transport?
>
> Ben
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] RDMA over Ethernet in Open MPI - RoCE on AWS?

2018-09-11 Thread Benjamin Brock
Thanks for your response.

One question: why would RoCE still requiring host processing of every
packet? I thought the point was that some nice server Ethernet NICs can
handle RDMA requests directly?  Or am I misunderstanding RoCE/how Open
MPI's RoCE transport?

Ben
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] stdout/stderr question

2018-09-11 Thread Jeff Squyres (jsquyres) via users
Gilles: Can you submit a PR to fix these 2 places?

Thanks!

> On Sep 11, 2018, at 9:10 AM, emre brookes  wrote:
> 
> Gilles Gouaillardet wrote:
>> It seems I got it wrong :-(
> Ah, you've joined the rest of us :)
>> 
>> Can you please give the attached patch a try ?
>> 
> Working with a git clone of 3.1.x, patch applied
> 
> $ /src/ompi-3.1.x/bin/mpicxx test.cpp
> $ /src/ompi-3.1.x/bin/mpirun a.out > stdout
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
> Process name: [[2667,1],2]
> Exit code:255
> --
> $ cat stdout
> hello from 1
> hello from 2
> hello from 3
> hello from 5
> hello from 0
> hello from 4
> $
> 
> Works correctly for this error message.
> 
> Thanks,
> -Emre
> 
>> 
>> FWIW, an other option would be to opal_output(orte_help_output, ...) but we 
>> would have to make orte_help_output "public first.
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> On 9/11/2018 11:14 AM, emre brookes wrote:
>>> Gilles Gouaillardet wrote:
 I investigated a this a bit and found that the (latest ?) v3 branches have 
 the expected behavior
 
 (e.g. the error messages is sent to stderr)
 
 
 Since it is very unlikely Open MPI 2.1 will ever be updated, I can simply 
 encourage you to upgrade to a newer Open MPI version.
 
 Latest fully supported versions are currently such as 3.1.2 or 3.0.2
 
 
 
 Cheers,
 
 Gilles
 
 
>>> So you tested 3.1.2 or something newer with this error?
>>> 
 But the originally reported error still goes to stdout:
 
 $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
 $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
 -- 
 mpirun detected that one or more processes exited with non-zero status, 
 thus causing
 the job to be terminated. The first process to do so was:
 
  Process name: [[22380,1],0]
  Exit code:255
 -- 
 $ cat stdout
 hello from 0
 hello from 1
 ---
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
 ---
 $
>>> -Emre
>>> 
>>> 
>>> 
 
 On 9/11/2018 2:27 AM, Ralph H Castain wrote:
> I’m not sure why this would be happening. These error outputs go through 
> the “show_help” functionality, and we specifically target it at stderr:
> 
> /* create an output stream for us */
> OBJ_CONSTRUCT(, opal_output_stream_t);
> lds.lds_want_stderr = true;
> orte_help_output = opal_output_open();
> 
> Jeff: is it possible the opal_output system is ignoring the request and 
> pushing it to stdout??
> Ralph
> 
> 
>> On Sep 5, 2018, at 4:11 AM, emre brookes  wrote:
>> 
>> Thanks Gilles,
>> 
>> My goal is to separate openmpi errors from the stdout of the MPI program 
>> itself so that errors can be identified externally (in particular in an 
>> external framework running MPI jobs from various developers).
>> 
>> My not so "well written MPI program" was doing this:
>>   MPI_Finalize();
>>   exit( errorcode );
>> Which I assume you are telling me was bad practice & will replace with
>>   MPI_Abort( MPI_COMM_WORLD, errorcode );
>>   MPI_Finalize();
>>   exit( errorcode );
>> I was previously a bit put off of MPI_Abort due to the vagueness of the 
>> man page:
>>> _Description_
>>> This routine makes a "best attempt" to abort all tasks in the group of 
>>> comm. This function does not require that the invoking environment take 
>>> any action with the error code. However, a UNIX or POSIX environment 
>>> should handle this as a return errorcode from the main program or an 
>>> abort (errorcode).
>> & I didn't really have an MPI issue to "Abort", but had used this for a 
>> user input or parameter issue.
>> Nevertheless, I accept your best practice recommendation.
>> 
>> It was not only the originally reported message, other messages went to 
>> stdout.
>> Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin 
>> libopenmpi-dev" which got me version (1.10.2),
>> but this morning 

Re: [OMPI users] stdout/stderr question

2018-09-11 Thread emre brookes

Gilles Gouaillardet wrote:

It seems I got it wrong :-(

Ah, you've joined the rest of us :)


Can you please give the attached patch a try ?


Working with a git clone of 3.1.x, patch applied

$ /src/ompi-3.1.x/bin/mpicxx test.cpp
$ /src/ompi-3.1.x/bin/mpirun a.out > stdout
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun detected that one or more processes exited with non-zero status, 
thus causing

the job to be terminated. The first process to do so was:

Process name: [[2667,1],2]
Exit code:255
--
$ cat stdout
hello from 1
hello from 2
hello from 3
hello from 5
hello from 0
hello from 4
$

Works correctly for this error message.

Thanks,
-Emre



FWIW, an other option would be to opal_output(orte_help_output, ...) 
but we would have to make orte_help_output "public first.



Cheers,


Gilles




On 9/11/2018 11:14 AM, emre brookes wrote:

Gilles Gouaillardet wrote:
I investigated a this a bit and found that the (latest ?) v3 
branches have the expected behavior


(e.g. the error messages is sent to stderr)


Since it is very unlikely Open MPI 2.1 will ever be updated, I can 
simply encourage you to upgrade to a newer Open MPI version.


Latest fully supported versions are currently such as 3.1.2 or 3.0.2



Cheers,

Gilles



So you tested 3.1.2 or something newer with this error?


But the originally reported error still goes to stdout:

$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
-- 

mpirun detected that one or more processes exited with non-zero 
status, thus causing

the job to be terminated. The first process to do so was:

  Process name: [[22380,1],0]
  Exit code:255
-- 


$ cat stdout
hello from 0
hello from 1
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---
$

-Emre





On 9/11/2018 2:27 AM, Ralph H Castain wrote:
I’m not sure why this would be happening. These error outputs go 
through the “show_help” functionality, and we specifically target 
it at stderr:


 /* create an output stream for us */
 OBJ_CONSTRUCT(, opal_output_stream_t);
 lds.lds_want_stderr = true;
 orte_help_output = opal_output_open();

Jeff: is it possible the opal_output system is ignoring the request 
and pushing it to stdout??

Ralph



On Sep 5, 2018, at 4:11 AM, emre brookes  wrote:

Thanks Gilles,

My goal is to separate openmpi errors from the stdout of the MPI 
program itself so that errors can be identified externally (in 
particular in an external framework running MPI jobs from various 
developers).


My not so "well written MPI program" was doing this:
   MPI_Finalize();
   exit( errorcode );
Which I assume you are telling me was bad practice & will replace 
with

   MPI_Abort( MPI_COMM_WORLD, errorcode );
   MPI_Finalize();
   exit( errorcode );
I was previously a bit put off of MPI_Abort due to the vagueness 
of the man page:

_Description_
This routine makes a "best attempt" to abort all tasks in the 
group of comm. This function does not require that the invoking 
environment take any action with the error code. However, a UNIX 
or POSIX environment should handle this as a return errorcode 
from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this 
for a user input or parameter issue.

Nevertheless, I accept your best practice recommendation.

It was not only the originally reported message, other messages 
went to stdout.
Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin 
libopenmpi-dev" which got me version (1.10.2),
but this morning compiled and tested 2.1.5, with the same 
behavior, e.g.:


$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages

$ cat stdout
hello from 0
hello from 1
-- 


MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI