Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-15 Thread Jeff Squyres (jsquyres)
On May 15, 2018, at 1:39 AM, Max Mellette  wrote:
> 
> Thanks everyone for all your assistance. The problem seems to be resolved 
> now, although I'm not entirely sure why these changes made a difference. 
> There were two things I changed:
> 
> (1) I had some additional `export ...` lines in .bashrc before the `export 
> PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those lines 
> (and then later added them back in below the PATH and LD_LIBRARY_PATH lines) 
> mpirun worked. But only b09-30 was able to execute code on b09-32 and not the 
> other way around.

It depends on what those "export..." Lines were, and whether you moved them 
below where non-interactive shells exited your .bashrc.

> (2) I passed IP addresses to mpirun instead of the hostnames (this didn't 
> work previously), and now mpirun works in both directions (b09-30 -> b09-32 
> and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still works 
> when passing IP addresses. For some reason using the host name doesn't work 
> despite the fact that I can use it to ssh.

FWIW, that *shouldn't* matter.  Gus pointed out that you can use /etc/hosts, 
but Open MPI should fully be able to use names instead of IP addresses.

If you're having problems with this, it makes me think that there may still be 
something weird in your environment, but hey, if you're ok using IP addresses 
and that's working -- might be good enough.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-15 Thread Gustavo Correa
Hi Max

Name resolution in /etc/hosts is a simple solution for (2).

I hope this helps,
Gus

> On May 15, 2018, at 01:39, Max Mellette  wrote:
> 
> Thanks everyone for all your assistance. The problem seems to be resolved 
> now, although I'm not entirely sure why these changes made a difference. 
> There were two things I changed:
> 
> (1) I had some additional `export ...` lines in .bashrc before the `export 
> PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those lines 
> (and then later added them back in below the PATH and LD_LIBRARY_PATH lines) 
> mpirun worked. But only b09-30 was able to execute code on b09-32 and not the 
> other way around.
> 
> (2) I passed IP addresses to mpirun instead of the hostnames (this didn't 
> work previously), and now mpirun works in both directions (b09-30 -> b09-32 
> and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still works 
> when passing IP addresses. For some reason using the host name doesn't work 
> despite the fact that I can use it to ssh.
> 
> Also FWIW I wasn't using a debugger.
> 
> Thanks again,
> Max
> 
> 
> On Mon, May 14, 2018 at 4:39 PM, Gilles Gouaillardet  
> wrote:
> In the initial report, the /usr/bin/ssh process was in the 'T' state
> (it generally hints the process is attached by a debugger)
> 
> /usr/bin/ssh -x b09-32 orted
> 
> did behave as expected (e.g. orted was executed, exited with an error since 
> the command line is invalid, and error message was received)
> 
> 
> can you try to run
> 
> /home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname
> 
> and see how things go ? (since you simply 'ssh orted', an other orted might 
> be used)
> 
> If you are still facing the same hang with ssh in the 'T' state, can you 
> check the logs on b09-32 and see
> if the sshd server was even contacted ? I can hardly make sense of this error 
> fwiw.
> 
> 
> Cheers,
> 
> Gilles
> 
> On 5/15/2018 5:27 AM, r...@open-mpi.org wrote:
> You got that error because the orted is looking for its rank on the cmd line 
> and not finding it.
> 
> 
> On May 14, 2018, at 12:37 PM, Max Mellette  > wrote:
> 
> Hi Gus,
> 
> Thanks for the suggestions. The correct version of openmpi seems to be 
> getting picked up; I also prepended .bashrc with the installation path like 
> you suggested, but it didn't seemed to help:
> 
> user@b09-30:~$ cat .bashrc
> export 
> PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
> export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
> user@b09-30:~$ which mpicc
> /home/user/openmpi_install/bin/mpicc
> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ess_env_module.c at line 147
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> util/session_dir.c at line 106
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> util/session_dir.c at line 345
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> base/ess_base_std_orted.c at line 270
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
> --
> 
> Thanks,
> Max
> 
> 
> On Mon, May 14, 2018 at 11:41 AM, Gus Correa  > wrote:
> 
> Hi Max
> 
> Just in case, as environment mix often happens.
> Could it be that you are inadvertently picking another
> installation of OpenMPI, perhaps installed from packages
> in /usr , or /usr/local?
> That's easy to check with 'which mpiexec' or
> 'which mpicc', for instance.
> 
> Have you tried to prepend (as opposed to append) OpenMPI
> to your PATH? Say:
> 
> export
> 
> PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'
> 
> I hope this helps,
> Gus Correa
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 
> 

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Max Mellette
Thanks everyone for all your assistance. The problem seems to be resolved
now, although I'm not entirely sure why these changes made a difference.
There were two things I changed:

(1) I had some additional `export ...` lines in .bashrc before the `export
PATH=...` and `export LD_LIBRARY_PATH=...` lines. When I removed those
lines (and then later added them back in below the PATH and LD_LIBRARY_PATH
lines) mpirun worked. But only b09-30 was able to execute code on b09-32
and not the other way around.

(2) I passed IP addresses to mpirun instead of the hostnames (this didn't
work previously), and now mpirun works in both directions (b09-30 -> b09-32
and b09-32 -> b09-30). I added a 3rd host in the rack and mpirun still
works when passing IP addresses. For some reason using the host name
doesn't work despite the fact that I can use it to ssh.

Also FWIW I wasn't using a debugger.

Thanks again,
Max


On Mon, May 14, 2018 at 4:39 PM, Gilles Gouaillardet 
wrote:

> In the initial report, the /usr/bin/ssh process was in the 'T' state
> (it generally hints the process is attached by a debugger)
>
> /usr/bin/ssh -x b09-32 orted
>
> did behave as expected (e.g. orted was executed, exited with an error
> since the command line is invalid, and error message was received)
>
>
> can you try to run
>
> /home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname
>
> and see how things go ? (since you simply 'ssh orted', an other orted
> might be used)
>
> If you are still facing the same hang with ssh in the 'T' state, can you
> check the logs on b09-32 and see
> if the sshd server was even contacted ? I can hardly make sense of this
> error fwiw.
>
>
> Cheers,
>
> Gilles
>
> On 5/15/2018 5:27 AM, r...@open-mpi.org wrote:
>
>> You got that error because the orted is looking for its rank on the cmd
>> line and not finding it.
>>
>>
>> On May 14, 2018, at 12:37 PM, Max Mellette > wmell...@ucsd.edu>> wrote:
>>>
>>> Hi Gus,
>>>
>>> Thanks for the suggestions. The correct version of openmpi seems to be
>>> getting picked up; I also prepended .bashrc with the installation path like
>>> you suggested, but it didn't seemed to help:
>>>
>>> user@b09-30:~$ cat .bashrc
>>> export PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
>>> local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
>>> local/games:/snap/bin
>>> export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
>>> user@b09-30:~$ which mpicc
>>> /home/user/openmpi_install/bin/mpicc
>>> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
>>> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>>> ess_env_module.c at line 147
>>> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
>>> file util/session_dir.c at line 106
>>> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
>>> file util/session_dir.c at line 345
>>> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
>>> file base/ess_base_std_orted.c at line 270
>>> 
>>> --
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>>   orte_session_dir failed
>>>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>>> 
>>> --
>>>
>>> Thanks,
>>> Max
>>>
>>>
>>> On Mon, May 14, 2018 at 11:41 AM, Gus Correa >> > wrote:
>>>
>>> Hi Max
>>>
>>> Just in case, as environment mix often happens.
>>> Could it be that you are inadvertently picking another
>>> installation of OpenMPI, perhaps installed from packages
>>> in /usr , or /usr/local?
>>> That's easy to check with 'which mpiexec' or
>>> 'which mpicc', for instance.
>>>
>>> Have you tried to prepend (as opposed to append) OpenMPI
>>> to your PATH? Say:
>>>
>>> export
>>> PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
>>> local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
>>> local/games:/snap/bin'
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org 
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> 

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Gilles Gouaillardet

In the initial report, the /usr/bin/ssh process was in the 'T' state
(it generally hints the process is attached by a debugger)

/usr/bin/ssh -x b09-32 orted

did behave as expected (e.g. orted was executed, exited with an error 
since the command line is invalid, and error message was received)



can you try to run

/home/user/openmpi_install/bin/mpirun --host b09-30,b09-32 hostname

and see how things go ? (since you simply 'ssh orted', an other orted 
might be used)


If you are still facing the same hang with ssh in the 'T' state, can you 
check the logs on b09-32 and see
if the sshd server was even contacted ? I can hardly make sense of this 
error fwiw.



Cheers,

Gilles

On 5/15/2018 5:27 AM, r...@open-mpi.org wrote:
You got that error because the orted is looking for its rank on the 
cmd line and not finding it.



On May 14, 2018, at 12:37 PM, Max Mellette > wrote:


Hi Gus,

Thanks for the suggestions. The correct version of openmpi seems to 
be getting picked up; I also prepended .bashrc with the installation 
path like you suggested, but it didn't seemed to help:


user@b09-30:~$ cat .bashrc
export 
PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
user@b09-30:~$ which mpicc
/home/user/openmpi_install/bin/mpicc
user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in 
file util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in 
file util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in 
file base/ess_base_std_orted.c at line 270

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--

Thanks,
Max


On Mon, May 14, 2018 at 11:41 AM, Gus Correa > wrote:


Hi Max

Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.

Have you tried to prepend (as opposed to append) OpenMPI
to your PATH? Say:

export

PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'

I hope this helps,
Gus Correa


___
users mailing list
users@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Jeff Squyres (jsquyres)
Yes, that "T" state is quite puzzling.  You didn't attach a debugger or hit the 
ssh with a signal, did you?

(we had a similar situation on the devel list recently, but it only happened 
with a very old version of Slurm.  We concluded that it was a SLURM bug that 
has since been fixed.  And just to be sure, I just double checked: the srun 
that hangs in that case is *not* in the "T" state -- it's in the "S" state, 
which appears to be a normal state)


> On May 12, 2018, at 4:56 AM, Gilles Gouaillardet 
>  wrote:
> 
> Max,
> 
> the 'T' state of the ssh process is very puzzling.
> 
> can you try to run
> /usr/bin/ssh -x b09-32 orted
> on b09-30 and see what happens ?
> (it should fail with an error message, instead of hanging)
> 
> In order to check there is no firewall, can you run instead
> iptables -L
> Also, is 'selinux' enabled ? there could be some rules that prevent
> 'ssh' from working as expected
> 
> 
> Cheers,
> 
> Gilles
> 
> On Sat, May 12, 2018 at 7:38 AM, Max Mellette  wrote:
>> Hi Jeff,
>> 
>> Thanks for the reply. FYI since I originally posted this, I uninstalled
>> OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
>> problem.
>> 
>> When I run the command without the `--mca plm_base_verbose 100` flag, it
>> hangs indefinitely with no output.
>> 
>> As far as I can tell, these are the additional processes running on each
>> machine while mpirun is hanging (printed using `ps -aux | less`):
>> 
>> On executing host b09-30:
>> 
>> user 361714  0.4  0.0 293016  8444 pts/0Sl+  15:10   0:00 mpirun
>> --host b09-30,b09-32 hostname
>> user 361719  0.0  0.0  37092  5112 pts/0T15:10   0:00
>> /usr/bin/ssh -x b09-32  orted -mca ess "env" -mca ess_base_jobid "638517248"
>> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
>> "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
>> "638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
>> pmix "^s1,s2,cray,isolated"
>> 
>> On remote host b09-32:
>> 
>> root 175273  0.0  0.0  61752  5824 ?Ss   15:10   0:00 sshd:
>> [accepted]
>> sshd 175274  0.0  0.0  61752   708 ?S15:10   0:00 sshd:
>> [net]
>> 
>> I only see orted showing up in the ssh flags on b09-30. Any ideas what I
>> should try next?
>> 
>> Thanks,
>> Max
>> 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread r...@open-mpi.org
You got that error because the orted is looking for its rank on the cmd line 
and not finding it.


> On May 14, 2018, at 12:37 PM, Max Mellette  wrote:
> 
> Hi Gus,
> 
> Thanks for the suggestions. The correct version of openmpi seems to be 
> getting picked up; I also prepended .bashrc with the installation path like 
> you suggested, but it didn't seemed to help:
> 
> user@b09-30:~$ cat .bashrc
> export 
> PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
> export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
> user@b09-30:~$ which mpicc
> /home/user/openmpi_install/bin/mpicc
> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ess_env_module.c at line 147
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> util/session_dir.c at line 106
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> util/session_dir.c at line 345
> [b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> base/ess_base_std_orted.c at line 270
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
> --
> 
> Thanks,
> Max
> 
> 
> On Mon, May 14, 2018 at 11:41 AM, Gus Correa  > wrote:
> Hi Max
> 
> Just in case, as environment mix often happens.
> Could it be that you are inadvertently picking another
> installation of OpenMPI, perhaps installed from packages
> in /usr , or /usr/local?
> That's easy to check with 'which mpiexec' or
> 'which mpicc', for instance.
> 
> Have you tried to prepend (as opposed to append) OpenMPI
> to your PATH? Say:
> 
> export 
> PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'
> 
> I hope this helps,
> Gus Correa
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Max Mellette
Hi Gus,

Thanks for the suggestions. The correct version of openmpi seems to be
getting picked up; I also prepended .bashrc with the installation path like
you suggested, but it didn't seemed to help:

user@b09-30:~$ cat .bashrc
export
PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib
user@b09-30:~$ which mpicc
/home/user/openmpi_install/bin/mpicc
user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:204536] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--

Thanks,
Max


On Mon, May 14, 2018 at 11:41 AM, Gus Correa  wrote:

> Hi Max
>
> Just in case, as environment mix often happens.
> Could it be that you are inadvertently picking another
> installation of OpenMPI, perhaps installed from packages
> in /usr , or /usr/local?
> That's easy to check with 'which mpiexec' or
> 'which mpicc', for instance.
>
> Have you tried to prepend (as opposed to append) OpenMPI
> to your PATH? Say:
>
> export PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/
> local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/
> local/games:/snap/bin'
>
> I hope this helps,
> Gus Correa
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Gus Correa

Hi Max

Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.

Have you tried to prepend (as opposed to append) OpenMPI
to your PATH? Say:

export 
PATH='/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'


I hope this helps,
Gus Correa

On 05/14/2018 12:40 PM, Max Mellette wrote:

John,

Thanks for the suggestions. In this case there is no cluster manager / 
job scheduler; these are just a couple of individual hosts in a rack. 
The reason for the generic names is that I anonymized the full network 
address in the previous posts, truncating to just the host name.


My home directory is network-mounted to both hosts. In fact, I 
uninstalled OpenMPI 3.0.1 from /usr/local on both hosts, and installed 
OpenMPI 3.1.0 into my home directory at `/home/user/openmpi_install`, 
also updating .bashrc appropriately:


user@b09-30:~$ cat .bashrc
export 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/openmpi_install/bin

export LD_LIBRARY_PATH=/home/user/openmpi_install/lib

So the environment should be the same on both hosts.

Thanks,
Max

On Mon, May 14, 2018 at 12:29 AM, John Hearns via users 
> wrote:


One very, very stupid question here. This arose over on the Slurm
list actually.
Those hostnames look like quite generic names, ie they are part of
an HPC cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?

On 14 May 2018 at 06:44, Max Mellette > wrote:

Hi Gilles,

Thanks for the suggestions; the results are below. Any ideas
where to go from here?

- Seems that selinux is not installed:

user@b09-30:~$ sestatus
The program 'sestatus' is currently not installed. You can
install it by typing:
sudo apt install policycoreutils

- Output from orted:

user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad
parameter in file base/ess_base_std_orted.c at line 270

--
It looks like orte_init failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal
failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   orte_session_dir failed
   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--

- iptables rules:

user@b09-30:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-input  all  --  anywhere             anywhere
ufw-before-input  all  --  anywhere             anywhere
ufw-after-input  all  --  anywhere             anywhere
ufw-after-logging-input  all  --  anywhere             anywhere
ufw-reject-input  all  --  anywhere             anywhere
ufw-track-input  all  --  anywhere             anywhere

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-forward  all  --  anywhere             anywhere
ufw-before-forward  all  --  anywhere             anywhere
ufw-after-forward  all  --  anywhere             anywhere
ufw-after-logging-forward  all  --  anywhere             anywhere
ufw-reject-forward  all  --  anywhere             anywhere
ufw-track-forward  all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-output  all  --  anywhere             anywhere
ufw-before-output  all  --  anywhere             anywhere
ufw-after-output  all  --  anywhere             anywhere
ufw-after-logging-output  all  

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread Max Mellette
John,

Thanks for the suggestions. In this case there is no cluster manager / job
scheduler; these are just a couple of individual hosts in a rack. The
reason for the generic names is that I anonymized the full network address
in the previous posts, truncating to just the host name.

My home directory is network-mounted to both hosts. In fact, I uninstalled
OpenMPI 3.0.1 from /usr/local on both hosts, and installed OpenMPI 3.1.0
into my home directory at `/home/user/openmpi_install`, also updating
.bashrc appropriately:

user@b09-30:~$ cat .bashrc
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/
sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/
user/openmpi_install/bin
export LD_LIBRARY_PATH=/home/user/openmpi_install/lib

So the environment should be the same on both hosts.

Thanks,
Max

On Mon, May 14, 2018 at 12:29 AM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> One very, very stupid question here. This arose over on the Slurm list
> actually.
> Those hostnames look like quite generic names, ie they are part of an HPC
> cluster?
> Do they happen to have independednt home directories for your userid?
> Could that possibly make a difference to the MPI launcher?
>
> On 14 May 2018 at 06:44, Max Mellette  wrote:
>
>> Hi Gilles,
>>
>> Thanks for the suggestions; the results are below. Any ideas where to go
>> from here?
>>
>> - Seems that selinux is not installed:
>>
>> user@b09-30:~$ sestatus
>> The program 'sestatus' is currently not installed. You can install it by
>> typing:
>> sudo apt install policycoreutils
>>
>> - Output from orted:
>>
>> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
>> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>> ess_env_module.c at line 147
>> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
>> util/session_dir.c at line 106
>> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
>> util/session_dir.c at line 345
>> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
>> base/ess_base_std_orted.c at line 270
>> 
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_session_dir failed
>>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>> 
>> --
>>
>> - iptables rules:
>>
>> user@b09-30:~$ sudo iptables -L
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>> ufw-before-logging-input  all  --  anywhere anywhere
>> ufw-before-input  all  --  anywhere anywhere
>> ufw-after-input  all  --  anywhere anywhere
>> ufw-after-logging-input  all  --  anywhere anywhere
>> ufw-reject-input  all  --  anywhere anywhere
>> ufw-track-input  all  --  anywhere anywhere
>>
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>> ufw-before-logging-forward  all  --  anywhere anywhere
>> ufw-before-forward  all  --  anywhere anywhere
>> ufw-after-forward  all  --  anywhere anywhere
>> ufw-after-logging-forward  all  --  anywhere anywhere
>> ufw-reject-forward  all  --  anywhere anywhere
>> ufw-track-forward  all  --  anywhere anywhere
>>
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>> ufw-before-logging-output  all  --  anywhere anywhere
>> ufw-before-output  all  --  anywhere anywhere
>> ufw-after-output  all  --  anywhere anywhere
>> ufw-after-logging-output  all  --  anywhere anywhere
>> ufw-reject-output  all  --  anywhere anywhere
>> ufw-track-output  all  --  anywhere anywhere
>>
>> Chain ufw-after-forward (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-after-input (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-after-logging-forward (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-after-logging-input (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-after-logging-output (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-after-output (1 references)
>> target prot opt source   destination
>>
>> Chain ufw-before-forward (1 references)
>> target prot opt source   destination
>>
>> Chain 

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-14 Thread John Hearns via users
One very, very stupid question here. This arose over on the Slurm list
actually.
Those hostnames look like quite generic names, ie they are part of an HPC
cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?

On 14 May 2018 at 06:44, Max Mellette  wrote:

> Hi Gilles,
>
> Thanks for the suggestions; the results are below. Any ideas where to go
> from here?
>
> - Seems that selinux is not installed:
>
> user@b09-30:~$ sestatus
> The program 'sestatus' is currently not installed. You can install it by
> typing:
> sudo apt install policycoreutils
>
> - Output from orted:
>
> user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ess_env_module.c at line 147
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> util/session_dir.c at line 106
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> util/session_dir.c at line 345
> [b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 270
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_session_dir failed
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
> --
>
> - iptables rules:
>
> user@b09-30:~$ sudo iptables -L
> Chain INPUT (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-input  all  --  anywhere anywhere
> ufw-before-input  all  --  anywhere anywhere
> ufw-after-input  all  --  anywhere anywhere
> ufw-after-logging-input  all  --  anywhere anywhere
> ufw-reject-input  all  --  anywhere anywhere
> ufw-track-input  all  --  anywhere anywhere
>
> Chain FORWARD (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-forward  all  --  anywhere anywhere
> ufw-before-forward  all  --  anywhere anywhere
> ufw-after-forward  all  --  anywhere anywhere
> ufw-after-logging-forward  all  --  anywhere anywhere
> ufw-reject-forward  all  --  anywhere anywhere
> ufw-track-forward  all  --  anywhere anywhere
>
> Chain OUTPUT (policy ACCEPT)
> target prot opt source   destination
> ufw-before-logging-output  all  --  anywhere anywhere
> ufw-before-output  all  --  anywhere anywhere
> ufw-after-output  all  --  anywhere anywhere
> ufw-after-logging-output  all  --  anywhere anywhere
> ufw-reject-output  all  --  anywhere anywhere
> ufw-track-output  all  --  anywhere anywhere
>
> Chain ufw-after-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-after-input (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-input (1 references)
> target prot opt source   destination
>
> Chain ufw-after-logging-output (1 references)
> target prot opt source   destination
>
> Chain ufw-after-output (1 references)
> target prot opt source   destination
>
> Chain ufw-before-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-before-input (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-input (1 references)
> target prot opt source   destination
>
> Chain ufw-before-logging-output (1 references)
> target prot opt source   destination
>
> Chain ufw-before-output (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-input (1 references)
> target prot opt source   destination
>
> Chain ufw-reject-output (1 references)
> target prot opt source   destination
>
> Chain ufw-track-forward (1 references)
> target prot opt source   destination
>
> Chain ufw-track-input (1 references)
> target prot opt source   destination
>
> Chain ufw-track-output (1 references)
> 

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-13 Thread Max Mellette
Hi Gilles,

Thanks for the suggestions; the results are below. Any ideas where to go
from here?

- Seems that selinux is not installed:

user@b09-30:~$ sestatus
The program 'sestatus' is currently not installed. You can install it by
typing:
sudo apt install policycoreutils

- Output from orted:

user@b09-30:~$ /usr/bin/ssh -x b09-32 orted
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_env_module.c at line 147
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 106
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
util/session_dir.c at line 345
[b09-32:197698] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 270
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
--

- iptables rules:

user@b09-30:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination
ufw-before-logging-input  all  --  anywhere anywhere
ufw-before-input  all  --  anywhere anywhere
ufw-after-input  all  --  anywhere anywhere
ufw-after-logging-input  all  --  anywhere anywhere
ufw-reject-input  all  --  anywhere anywhere
ufw-track-input  all  --  anywhere anywhere

Chain FORWARD (policy ACCEPT)
target prot opt source   destination
ufw-before-logging-forward  all  --  anywhere anywhere
ufw-before-forward  all  --  anywhere anywhere
ufw-after-forward  all  --  anywhere anywhere
ufw-after-logging-forward  all  --  anywhere anywhere
ufw-reject-forward  all  --  anywhere anywhere
ufw-track-forward  all  --  anywhere anywhere

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
ufw-before-logging-output  all  --  anywhere anywhere
ufw-before-output  all  --  anywhere anywhere
ufw-after-output  all  --  anywhere anywhere
ufw-after-logging-output  all  --  anywhere anywhere
ufw-reject-output  all  --  anywhere anywhere
ufw-track-output  all  --  anywhere anywhere

Chain ufw-after-forward (1 references)
target prot opt source   destination

Chain ufw-after-input (1 references)
target prot opt source   destination

Chain ufw-after-logging-forward (1 references)
target prot opt source   destination

Chain ufw-after-logging-input (1 references)
target prot opt source   destination

Chain ufw-after-logging-output (1 references)
target prot opt source   destination

Chain ufw-after-output (1 references)
target prot opt source   destination

Chain ufw-before-forward (1 references)
target prot opt source   destination

Chain ufw-before-input (1 references)
target prot opt source   destination

Chain ufw-before-logging-forward (1 references)
target prot opt source   destination

Chain ufw-before-logging-input (1 references)
target prot opt source   destination

Chain ufw-before-logging-output (1 references)
target prot opt source   destination

Chain ufw-before-output (1 references)
target prot opt source   destination

Chain ufw-reject-forward (1 references)
target prot opt source   destination

Chain ufw-reject-input (1 references)
target prot opt source   destination

Chain ufw-reject-output (1 references)
target prot opt source   destination

Chain ufw-track-forward (1 references)
target prot opt source   destination

Chain ufw-track-input (1 references)
target prot opt source   destination

Chain ufw-track-output (1 references)
target prot opt source   destination


Thanks,
Max
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-12 Thread Gilles Gouaillardet
Max,

the 'T' state of the ssh process is very puzzling.

can you try to run
/usr/bin/ssh -x b09-32 orted
on b09-30 and see what happens ?
(it should fail with an error message, instead of hanging)

In order to check there is no firewall, can you run instead
iptables -L
Also, is 'selinux' enabled ? there could be some rules that prevent
'ssh' from working as expected


Cheers,

Gilles

On Sat, May 12, 2018 at 7:38 AM, Max Mellette  wrote:
> Hi Jeff,
>
> Thanks for the reply. FYI since I originally posted this, I uninstalled
> OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
> problem.
>
> When I run the command without the `--mca plm_base_verbose 100` flag, it
> hangs indefinitely with no output.
>
> As far as I can tell, these are the additional processes running on each
> machine while mpirun is hanging (printed using `ps -aux | less`):
>
> On executing host b09-30:
>
> user 361714  0.4  0.0 293016  8444 pts/0Sl+  15:10   0:00 mpirun
> --host b09-30,b09-32 hostname
> user 361719  0.0  0.0  37092  5112 pts/0T15:10   0:00
> /usr/bin/ssh -x b09-32  orted -mca ess "env" -mca ess_base_jobid "638517248"
> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
> "638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
> pmix "^s1,s2,cray,isolated"
>
> On remote host b09-32:
>
> root 175273  0.0  0.0  61752  5824 ?Ss   15:10   0:00 sshd:
> [accepted]
> sshd 175274  0.0  0.0  61752   708 ?S15:10   0:00 sshd:
> [net]
>
> I only see orted showing up in the ssh flags on b09-30. Any ideas what I
> should try next?
>
> Thanks,
> Max
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-11 Thread Max Mellette
Hi Jeff,

Thanks for the reply. FYI since I originally posted this, I uninstalled
OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
problem.

When I run the command without the `--mca plm_base_verbose 100` flag, it
hangs indefinitely with no output.

As far as I can tell, these are the additional processes running on each
machine while mpirun is hanging (printed using `ps -aux | less`):

On executing host b09-30:

user 361714  0.4  0.0 293016  8444 pts/0Sl+  15:10   0:00 mpirun
--host b09-30,b09-32 hostname
user 361719  0.0  0.0  37092  5112 pts/0T15:10   0:00
/usr/bin/ssh -x b09-32  orted -mca ess "env" -mca ess_base_jobid
"638517248" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_node_regex "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
"638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
pmix "^s1,s2,cray,isolated"

On remote host b09-32:

root 175273  0.0  0.0  61752  5824 ?Ss   15:10   0:00 sshd:
[accepted]
sshd 175274  0.0  0.0  61752   708 ?S15:10   0:00 sshd:
[net]

I only see orted showing up in the ssh flags on b09-30. Any ideas what I
should try next?

Thanks,
Max
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI 3.0.1 - mpirun hangs with 2 hosts

2018-05-11 Thread Jeff Squyres (jsquyres)
On May 4, 2018, at 1:08 PM, Max Mellette  wrote:
> 
> I'm trying to set up OpenMPI 3.0.1 on a pair of linux machines, but I'm 
> running into a problem where mpirun hangs when I try to execute a simple 
> command across the two machines:
> 
> $ mpirun --host b09-30,b09-32 hostname

Do you see the output from the 2 `hostname` commands when this runs?  Or does 
it just hang with no output?

> Here's some terminal output, including running the command above with --mca 
> plm_base_verbose 100  set:
> 
> user@b09-30:~$ sudo ufw status
> Status: inactive
> user@b09-30:~$ cat .bashrc
> export 
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
> export LD_LIBRARY_PATH=/usr/local/lib
> user@b09-30:~$ ssh b09-32 hostname
> b09-32
> user@b09-30:~$ mpirun --host b09-30 hostname
> b09-30
> user@b09-30:~$ mpirun --host b09-30,b09-32 --mca plm_base_verbose 100 hostname

I'm interested to see if you get the output from "hostname" when you don't use 
`--mca plm_base_verbose 100`.

Also, when this hangs, what is left running on b09-30 and b09-32?  Is it just 
mpirun?  Or are there any orted processes, too?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users