Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.

> ... you are running out of file descriptors. Are file descriptors
> limited on a per-process basis, perchance?

I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?

I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:

> It is concerning if the pipe system call fails - I can't think of why that
> would happen. Thats not usually a permissions issue but rather a deeper
> indication that something is either seriously wrong on your system or you are
> running out of file descriptors. Are file descriptors limited on a per-process
> basis, perchance?
>
> Sent from my PDA. No type good.
>
> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>
>> Hi Tena
>>
>> Since root can but you can't,
>> is is a directory permission problem perhaps?
>> Check the execution directory permission (on both machines,
>> if this is not NFS mounted dir).
>> I am not sure, but IIRR OpenMPI also uses /tmp for
>> under-the-hood stuff, worth checking permissions there also.
>> Just a naive guess.
>>
>> Congrats for all the progress with the cloudy MPI!
>>
>> Gus Correa
>>
>> Tena Sakai wrote:
>>> Hi,
>>> I have made a bit more progress.  I think I can say ssh authenti-
>>> cation problem is behind me now.  I am still having a problem running
>>> mpirun, but the latest discovery, which I can reproduce, is that
>>> I can run mpirun as root.  Here's the session log:
>>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>>  total 16
>>>  -rw------- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>>  -rw------- 1 tsakai tsakai  102 Feb 11 00:34 config
>>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>  -rw------- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>>  ip-10-100-243-195
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>>  logout
>>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>>  ip-10-195-198-31
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>  --------------------------------------------------------------------------
>>>  mpirun was unable to launch the specified application as it encountered an
>>> error:
>>>  Error: pipe function call failed when setting up I/O forwarding subsystem
>>>  Node: ip-10-195-198-31
>>>  while attempting to start process rank 0.
>>>  --------------------------------------------------------------------------
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # try it as root
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ sudo su
>>>  bash-3.2#
>>>  bash-3.2# pwd
>>>  /home/tsakai
>>>  bash-3.2#
>>>  bash-3.2# ls -l /root/.ssh/config
>>>  -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>  bash-3.2#
>>>  bash-3.2# cat /root/.ssh/config
>>>  Host *
>>>          IdentityFile /root/.ssh/.derobee/.kagi
>>>          IdentitiesOnly yes
>>>          BatchMode yes
>>>  bash-3.2#
>>>  bash-3.2# pwd
>>>  /home/tsakai
>>>  bash-3.2#
>>>  bash-3.2# ls -l
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>  bash-3.2#
>>>  bash-3.2# # now is the time for mpirun
>>>  bash-3.2#
>>>  bash-3.2# mpirun --app ./app.ac
>>>  13 ip-10-100-243-195
>>>  21 ip-10-100-243-195
>>>  5 ip-10-195-198-31
>>>  8 ip-10-195-198-31
>>>  bash-3.2#
>>>  bash-3.2# # It works (being root)!
>>>  bash-3.2#
>>>  bash-3.2# exit
>>>  exit
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>  --------------------------------------------------------------------------
>>>  mpirun was unable to launch the specified application as it encountered an
>>> error:
>>>  Error: pipe function call failed when setting up I/O forwarding subsystem
>>>  Node: ip-10-195-198-31
>>>  while attempting to start process rank 0.
>>>  --------------------------------------------------------------------------
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ exit
>>>  logout
>>>  [tsakai@vixen ec2]$
>>> So, why does it say "pipe function call failed when setting up
>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>> The node it is referring to is not the remote machine.  It is
>>> What I call machine A.  I first thought maybe this is a problem
>>> With PATH variable.  But I don't think so.  I compared root's
>>> Path to that of tsaki's and made them identical and retried.
>>> I got the same behavior.
>>> If you could enlighten me why this is happening, I would really
>>> Appreciate it.
>>> Thank you.
>>> Tena
>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:
>>>> Hi jeff,
>>>>
>>>> Thanks for the firewall tip.  I tried it while allowing all tip traffic
>>>> and got interesting and preplexing result.  Here's what's interesting
>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>   Host key verification failed.
>>>>
>>>> --------------------------------------------------------------------------
>>>>   A daemon (pid 2743) died unexpectedly with status 255 while attempting
>>>>   to launch so we are aborting.
>>>>
>>>>   There may be more information reported by the environment (see above).
>>>>
>>>>   This may be because the daemon was unable to find all the needed shared
>>>>   libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>> the
>>>>   location of the shared libraries on the remote nodes and this will
>>>>   automatically be forwarded to the remote nodes.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>   that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun: clean termination accomplished
>>>>
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>> /usr/local/lib
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>   Warning: Identity file tsakai not accessible: No such file or directory.
>>>>   Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>   [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>   [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>   LD_LIBRARY_PATH=/usr/local/lib
>>>>   [tsakai@ip-10-195-171-159 ~]$
>>>>   [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>   [tsakai@ip-10-195-171-159 ~]$ exit
>>>>   logout
>>>>   Connection to ip-10-195-171-159 closed.
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ hostname
>>>>   ip-10-203-21-132
>>>>   [tsakai@ip-10-203-21-132 ~]$ # try mpirun again
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>   Host key verification failed.
>>>>
>>>> --------------------------------------------------------------------------
>>>>   A daemon (pid 2789) died unexpectedly with status 255 while attempting
>>>>   to launch so we are aborting.
>>>>
>>>>   There may be more information reported by the environment (see above).
>>>>
>>>>   This may be because the daemon was unable to find all the needed shared
>>>>   libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>> the
>>>>   location of the shared libraries on the remote nodes and this will
>>>>   automatically be forwarded to the remote nodes.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>   that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun: clean termination accomplished
>>>>
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>> /usr/local/lib...
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>   total 16604
>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so ->
>>>> libfuse.so.2.8.5
>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libfuse.so.2 ->
>>>> libfuse.so.2.8.5
>>>>   lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so ->
>>>> libmca_common_sm.so.1.0.0
>>>>   lrwxrwxrwx 1 root root      25 Feb  8 23:06 libmca_common_sm.so.1 ->
>>>> libmca_common_sm.so.1.0.0
>>>>   lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so -> libmpi.so.0.0.2
>>>>   lrwxrwxrwx 1 root root      15 Feb  8 23:06 libmpi.so.0 ->
>>>> libmpi.so.0.0.2
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so ->
>>>> libmpi_cxx.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_cxx.so.0 ->
>>>> libmpi_cxx.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so ->
>>>> libmpi_f77.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f77.so.0 ->
>>>> libmpi_f77.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so ->
>>>> libmpi_f90.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      19 Feb  8 23:06 libmpi_f90.so.0 ->
>>>> libmpi_f90.so.0.0.1
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so ->
>>>> libopen-pal.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-pal.so.0 ->
>>>> libopen-pal.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so ->
>>>> libopen-rte.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libopen-rte.so.0 ->
>>>> libopen-rte.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so ->
>>>> libopenmpi_malloc.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      26 Feb  8 23:06 libopenmpi_malloc.so.0 ->
>>>> libopenmpi_malloc.so.0.0.0
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so ->
>>>> libulockmgr.so.1.0.1
>>>>   lrwxrwxrwx 1 root root      20 Feb  8 23:06 libulockmgr.so.1 ->
>>>> libulockmgr.so.1.0.1
>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so ->
>>>> libxml2.so.2.7.2
>>>>   lrwxrwxrwx 1 root root      16 Feb  8 23:06 libxml2.so.2 ->
>>>> libxml2.so.2.7.2
>>>>   -rw-r--r-- 1 root root  385912 Jan 26 01:00 libvt.a
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>   [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>   [tsakai@ip-10-203-21-132 ~]$
>>>>
>>>> Do you know why it's complaining about shared libraries?
>>>>
>>>> Thank you.
>>>>
>>>> Tena
>>>>
>>>>
>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>>>
>>>>> Your prior mails were about ssh issues, but this one sounds like you might
>>>>> have firewall issues.
>>>>>
>>>>> That is, the "orted" command attempts to open a TCP socket back to mpirun
>>>>> for
>>>>> various command and control reasons.  If it is blocked from doing so by a
>>>>> firewall, Open MPI won't run.  In general, you can either disable your
>>>>> firewall or you can setup a trust relationship for TCP connections within
>>>>> your
>>>>> cluster.
>>>>>
>>>>>
>>>>>
>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>
>>>>>> Hi Reuti,
>>>>>>
>>>>>> Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
>>>>>> session is captured in the attached file.
>>>>>>
>>>>>> What I did is much similar to what I have done before: verify
>>>>>> that ssh works and then run mpirun command.  In my a bit lengthy
>>>>>> session log, there are two responses from "LogLevel DEBUG3."  First
>>>>>> from an scp invocation and then from mpirun invocation.  They both
>>>>>> say
>>>>>>   debug1: Authentication succeeded (publickey).
>>>>>>
>>>>>>> From mpirun invocation, I see a line:
>>>>>>   debug1: Sending command:  orted --daemonize -mca ess env -mca
>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>>>>>>   2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>> gave me more lines.  But the lines after
>>>>>>   debug1: Sending command:  orted bla bla bla
>>>>>> doesn't look good to me.  But, in truth, I have no idea what they
>>>>>> mean.
>>>>>>
>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tena
>>>>>>
>>>>>>
>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>
>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>> No, my environment is entirely linux.  The path to my home
>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>> itself as /home/tsakai).  For historical reasons, I have
>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>> okay. Sometimes the protection of the home directory must be adjusted
>>>>>>> too,
>>>>>>> but
>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>
>>>>>>>
>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>> ssh-keys for each user.
>>>>>>>> No, it is not a private cluster.  It is Amazon EC2.  When I
>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>> domU-12-31-39-06-74-E2.  Both public and private dns names
>>>>>>>> change from a launch to another.  I am using passphrasesless
>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>> Amazon node B back to A.  (Please see my initail post.  There
>>>>>>>> is a session dialogue for this.)  They all work without authen-
>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>  The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>  can't be established.
>>>>>>>>   RSA key fingerprint is
>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>   Are you sure you want to continue connecting (yes/no)?
>>>>>>>> to which I say "yes."
>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>> Doesn't that mean with password?  If so, it is not an option.
>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>> users'
>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when the
>>>>>>> hostname changes every time it might also create new hostkeys. It uses
>>>>>>> hostkeys (private and public), this way it works for all users. Just for
>>>>>>> reference:
>>>>>>>
>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>
>>>>>>> You could look into it later.
>>>>>>>
>>>>>>> ==
>>>>>>>
>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>
>>>>>>> - What about putting:
>>>>>>>
>>>>>>> LogLevel DEBUG3
>>>>>>>
>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate
>>>>>>> before
>>>>>>> it fails in verbose mode.
>>>>>>>
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tena
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? I
>>>>>>>>> saw
>>>>>>>>> the
>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>
>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>
>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>> I made a config file in my .ssh directory on the cloud.  It looks
>>>>>>>>>> like:
>>>>>>>>>>  # machine A
>>>>>>>>>>  Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>> This is just an abbreviation or nickname above. To use the specified
>>>>>>>>> settings,
>>>>>>>>> it's necessary to specify exactly this name. When the settings are the
>>>>>>>>> same
>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>
>>>>>>>>> Host *
>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>  BatchMode yes
>>>>>>>>>
>>>>>>>>> instead.
>>>>>>>>>
>>>>>>>>> Is this a private cluster (or at least private interfaces)? It would
>>>>>>>>> also
>>>>>>>>> be
>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>> any
>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>
>>>>>>>>> -- Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  HostName domU-12-31-39-07-35-21
>>>>>>>>>>  BatchMode yes
>>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>  ChallengeResponseAuthentication no
>>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>>
>>>>>>>>>>  # machine B
>>>>>>>>>>  Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>  HostName domU-12-31-39-06-74-E2
>>>>>>>>>>  BatchMode yes
>>>>>>>>>>  IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>  ChallengeResponseAuthentication no
>>>>>>>>>>  IdentitiesOnly yes
>>>>>>>>>>
>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>
>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>  [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>
>>>>>>>>>> It hungs.  I control-C out of it and I get:
>>>>>>>>>>  mpirun: killing job...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>> ------------------------------------------------------------------------->>>
>>> >>>
>>>> -
>>>>>>>>>>  mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>> process
>>>>>>>>>>  that caused that situation.
>>>>>>>>>>
>>>>>>>>>>
>>> ------------------------------------------------------------------------->>>
>>> >>>
>>>> -
>>>>>>>>>>
>>> ------------------------------------------------------------------------->>>
>>> >>>
>>>> -
>>>>>>>>>>  mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>> shown
>>>>>>>>>>  below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>  the "orte-clean" tool for assistance.
>>>>>>>>>>
>>>>>>>>>>
>>> ------------------------------------------------------------------------->>>
>>> >>>
>>>> -
>>>>>>>>>>      domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>> report
>>>>>>>>>> back when launched
>>>>>>>>>>
>>>>>>>>>> Am I making progress?
>>>>>>>>>>
>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>> problem?
>>>>>>>>>> Does someone have an example .ssh/config file I can look at?  There
>>>>>>>>>> are
>>>>>>>>>> so
>>>>>>>>>> many keyword-argument paris for this config file and I would like to
>>>>>>>>>> look
>>>>>>>>>> at
>>>>>>>>>> some very basic one that works.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Tena Sakai
>>>>>>>>>> tsa...@gallo.ucsf.edu
>>>>>>>>>>
>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>  [tsakai@vixen local]$ cat app.ac1
>>>>>>>>>>>  -H vixen.egcrc.org   -np 1 Rscript
>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>  -H vixen.egcrc.org   -np 1 Rscript
>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>  -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>  -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>
>>>>>>>>>>> The program I run is
>>>>>>>>>>>  Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>> Where x is [5..8].  The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>
>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>  [ tsakai@vixen local]$ cat fib.R
>>>>>>>>>>>      # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>      # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>      # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>
>>>>>>>>>>>  fib <- function( n ) {
>>>>>>>>>>>          a <- 0
>>>>>>>>>>>          b <- 1
>>>>>>>>>>>          for ( i in 1:n ) {
>>>>>>>>>>>               t <- b
>>>>>>>>>>>               b <- a
>>>>>>>>>>>               a <- a + t
>>>>>>>>>>>          }
>>>>>>>>>>>      a
>>>>>>>>>>>
>>>>>>>>>>>  arg <- commandArgs( TRUE )
>>>>>>>>>>>  myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>  cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>
>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>> number
>>>>>>>>>>> that
>>>>>>>>>>> corresponds to that index, followed by the machine name.  Pretty
>>>>>>>>>>> simple
>>>>>>>>>>> stuff.
>>>>>>>>>>>
>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>  [tsakai@vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>  5 vixen.egcrc.org
>>>>>>>>>>>  8 vixen.egcrc.org
>>>>>>>>>>>  13 blitzen.egcrc.org
>>>>>>>>>>>  21 blitzen.egcrc.org
>>>>>>>>>>>
>>>>>>>>>>> Which is exactly what I expect.  So far so good.
>>>>>>>>>>>
>>>>>>>>>>> Now I want to run the same thing on cloud.  I launch 2 instances of
>>>>>>>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>  [tsakai@vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>
>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>> without
>>>>>>>>>>> password authentication,
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>  Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>  domU-12-31-39-0C-C8-01
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>> A
>>>>>>>>>>> without using password
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>  The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>> can't
>>>>>>>>>>> be established.
>>>>>>>>>>>  RSA key fingerprint is
>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>  Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>  Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>> list
>>>>>>>>>>> of
>>>>>>>>>>> known hosts.
>>>>>>>>>>>  Last login: Wed Feb  9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>  logout
>>>>>>>>>>>  Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>  logout
>>>>>>>>>>>  Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>  domU-12-31-39-00-D1-F2
>>>>>>>>>>>
>>>>>>>>>>> As you can see, neither machine uses password for authentication; it
>>>>>>>>>>> uses
>>>>>>>>>>> public/private key pairs.  There is no problem (that I can see) for
>>>>>>>>>>> ssh
>>>>>>>>>>> invocation
>>>>>>>>>>> from one machine to the other.  This is so because I have a copy of
>>>>>>>>>>> public
>>>>>>>>>>> key
>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>
>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>  -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>  -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>  -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>  -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>
>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>  tsakai@domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>  Permission denied, please try again.
>>>>>>>>>>>  tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>> ----------------------------------------------------------------------->>>>>
>>> >>>
>>> -
>>>>>>>>>>> --
>>>>>>>>>>>  mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>> process
>>>>>>>>>>>  that caused that situation.
>>>>>>>>>>>
>>>>>>>>>>>
>>> ----------------------------------------------------------------------->>>>>
>>> >>>
>>> -
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>  mpirun: clean termination accomplished
>>>>>>>>>>>
>>>>>>>>>>>  [tsakai@domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>
>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>
>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>> password?
>>>>>>>>>>>
>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Tena Sakai
>>>>>>>>>>> tsa...@gallo.ucsf.edu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> <session4Reuti.text>_______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to