Hi Jeff, Hi Gus, Thanks for your replies.
I have pretty much ruled out PATH issues by setting tsakai's PATH as identical to that of root. In that setting I reproduced the same result as before: root can run mpirun correctly and tsakai cannot. I have also checked out permission on /tmp directory. tsakai has no problem creating files under /tmp. I am trying to come up with a strategy to show that each and every programs in the PATH has "world" executable permission. It is a stone to turn over, but I am not holding my breath. > ... you are running out of file descriptors. Are file descriptors > limited on a per-process basis, perchance? I have never heard there is such restriction on Amazon EC2. There are folks who keep running instances for a long, long time. Whereas in my case, I launch 2 instances, check things out, and then turn the instances off. (Given that the state of California has a huge debts, our funding is very tight.) So, I really doubt that's the case. I have run mpirun unsuccessfully as user tsakai and immediately after successfully as root. Still, I would be happy if you can tell me a way to tell number of file descriptors used or remmain. Your mentioned file descriptors made me think of something under /dev. But I don't know exactly what I am fishing. Do you have some suggestions? I wish I could reproduce this (weired) behavior on a different set of machines. I certainly cannot in my local environment. Sigh! Regards, Tena On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: > It is concerning if the pipe system call fails - I can't think of why that > would happen. Thats not usually a permissions issue but rather a deeper > indication that something is either seriously wrong on your system or you are > running out of file descriptors. Are file descriptors limited on a per-process > basis, perchance? > > Sent from my PDA. No type good. > > On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > >> Hi Tena >> >> Since root can but you can't, >> is is a directory permission problem perhaps? >> Check the execution directory permission (on both machines, >> if this is not NFS mounted dir). >> I am not sure, but IIRR OpenMPI also uses /tmp for >> under-the-hood stuff, worth checking permissions there also. >> Just a naive guess. >> >> Congrats for all the progress with the cloudy MPI! >> >> Gus Correa >> >> Tena Sakai wrote: >>> Hi, >>> I have made a bit more progress. I think I can say ssh authenti- >>> cation problem is behind me now. I am still having a problem running >>> mpirun, but the latest discovery, which I can reproduce, is that >>> I can run mpirun as root. Here's the session log: >>> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ ll >>> total 8 >>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >>> total 16 >>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >>> [tsakai@ip-10-100-243-195 ~]$ hostname >>> ip-10-100-243-195 >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ ll >>> total 8 >>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >>> [tsakai@ip-10-100-243-195 ~]$ >>> [tsakai@ip-10-100-243-195 ~]$ exit >>> logout >>> Connection to ip-10-100-243-195.ec2.internal closed. >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ hostname >>> ip-10-195-198-31 >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >>> -------------------------------------------------------------------------- >>> mpirun was unable to launch the specified application as it encountered an >>> error: >>> Error: pipe function call failed when setting up I/O forwarding subsystem >>> Node: ip-10-195-198-31 >>> while attempting to start process rank 0. >>> -------------------------------------------------------------------------- >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ # try it as root >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ sudo su >>> bash-3.2# >>> bash-3.2# pwd >>> /home/tsakai >>> bash-3.2# >>> bash-3.2# ls -l /root/.ssh/config >>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >>> bash-3.2# >>> bash-3.2# cat /root/.ssh/config >>> Host * >>> IdentityFile /root/.ssh/.derobee/.kagi >>> IdentitiesOnly yes >>> BatchMode yes >>> bash-3.2# >>> bash-3.2# pwd >>> /home/tsakai >>> bash-3.2# >>> bash-3.2# ls -l >>> total 8 >>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>> bash-3.2# >>> bash-3.2# # now is the time for mpirun >>> bash-3.2# >>> bash-3.2# mpirun --app ./app.ac >>> 13 ip-10-100-243-195 >>> 21 ip-10-100-243-195 >>> 5 ip-10-195-198-31 >>> 8 ip-10-195-198-31 >>> bash-3.2# >>> bash-3.2# # It works (being root)! >>> bash-3.2# >>> bash-3.2# exit >>> exit >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >>> -------------------------------------------------------------------------- >>> mpirun was unable to launch the specified application as it encountered an >>> error: >>> Error: pipe function call failed when setting up I/O forwarding subsystem >>> Node: ip-10-195-198-31 >>> while attempting to start process rank 0. >>> -------------------------------------------------------------------------- >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >>> [tsakai@ip-10-195-198-31 ~]$ >>> [tsakai@ip-10-195-198-31 ~]$ exit >>> logout >>> [tsakai@vixen ec2]$ >>> So, why does it say "pipe function call failed when setting up >>> I/O forwarding subsystem Node: ip-10-195-198-31" ? >>> The node it is referring to is not the remote machine. It is >>> What I call machine A. I first thought maybe this is a problem >>> With PATH variable. But I don't think so. I compared root's >>> Path to that of tsaki's and made them identical and retried. >>> I got the same behavior. >>> If you could enlighten me why this is happening, I would really >>> Appreciate it. >>> Thank you. >>> Tena >>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>> Hi jeff, >>>> >>>> Thanks for the firewall tip. I tried it while allowing all tip traffic >>>> and got interesting and preplexing result. Here's what's interesting >>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run): >>>> >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>> Host key verification failed. >>>> >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 2743) died unexpectedly with status 255 while attempting >>>> to launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>> the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun: clean termination accomplished >>>> >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>>> /usr/local/lib >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>>> Warning: Identity file tsakai not accessible: No such file or directory. >>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>>> [tsakai@ip-10-195-171-159 ~]$ >>>> [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>> [tsakai@ip-10-195-171-159 ~]$ >>>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>>> LD_LIBRARY_PATH=/usr/local/lib >>>> [tsakai@ip-10-195-171-159 ~]$ >>>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>>> [tsakai@ip-10-195-171-159 ~]$ exit >>>> logout >>>> Connection to ip-10-195-171-159 closed. >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ hostname >>>> ip-10-203-21-132 >>>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>> Host key verification failed. >>>> >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 2789) died unexpectedly with status 255 while attempting >>>> to launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>> the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun: clean termination accomplished >>>> >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>>> /usr/local/lib... >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>>> total 16604 >>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>>> libfuse.so.2.8.5 >>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>>> libfuse.so.2.8.5 >>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>>> libmca_common_sm.so.1.0.0 >>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 -> >>>> libmca_common_sm.so.1.0.0 >>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> libmpi.so.0.0.2 >>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>>> libmpi.so.0.0.2 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>>> libmpi_cxx.so.0.0.1 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>>> libmpi_cxx.so.0.0.1 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>>> libmpi_f77.so.0.0.1 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>>> libmpi_f77.so.0.0.1 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>>> libmpi_f90.so.0.0.1 >>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>>> libmpi_f90.so.0.0.1 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>>> libopen-pal.so.0.0.0 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>>> libopen-pal.so.0.0.0 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>>> libopen-rte.so.0.0.0 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>>> libopen-rte.so.0.0.0 >>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>>> libopenmpi_malloc.so.0.0.0 >>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 -> >>>> libopenmpi_malloc.so.0.0.0 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>>> libulockmgr.so.1.0.1 >>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>>> libulockmgr.so.1.0.1 >>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>>> libxml2.so.2.7.2 >>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>>> libxml2.so.2.7.2 >>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>>> [tsakai@ip-10-203-21-132 ~]$ >>>> >>>> Do you know why it's complaining about shared libraries? >>>> >>>> Thank you. >>>> >>>> Tena >>>> >>>> >>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>> >>>>> Your prior mails were about ssh issues, but this one sounds like you might >>>>> have firewall issues. >>>>> >>>>> That is, the "orted" command attempts to open a TCP socket back to mpirun >>>>> for >>>>> various command and control reasons. If it is blocked from doing so by a >>>>> firewall, Open MPI won't run. In general, you can either disable your >>>>> firewall or you can setup a trust relationship for TCP connections within >>>>> your >>>>> cluster. >>>>> >>>>> >>>>> >>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>>> >>>>>> Hi Reuti, >>>>>> >>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>>> session is captured in the attached file. >>>>>> >>>>>> What I did is much similar to what I have done before: verify >>>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>>> from an scp invocation and then from mpirun invocation. They both >>>>>> say >>>>>> debug1: Authentication succeeded (publickey). >>>>>> >>>>>>> From mpirun invocation, I see a line: >>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs >>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>>> The IP address at the end of the line is indeed that of machine B. >>>>>> After that there was hanging and I controlled-C out of it, which >>>>>> gave me more lines. But the lines after >>>>>> debug1: Sending command: orted bla bla bla >>>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>>> mean. >>>>>> >>>>>> If you could shed some light, I would appreciate it very much. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Tena >>>>>> >>>>>> >>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>>> >>>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>>> No, my environment is entirely linux. The path to my home >>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>>> okay. Sometimes the protection of the home directory must be adjusted >>>>>>> too, >>>>>>> but >>>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>>> >>>>>>> >>>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>>> ssh-keys for each user. >>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>>> but to address from one amazon cluster node to the other I >>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>>> change from a launch to another. I am using passphrasesless >>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>>> can't be established. >>>>>>>> RSA key fingerprint is >>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>>> to which I say "yes." >>>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>>> No. It's convenient inside a private cluster as it won't fill each >>>>>>> users' >>>>>>> known_hosts file and you don't need to create any ssh-keys. But when the >>>>>>> hostname changes every time it might also create new hostkeys. It uses >>>>>>> hostkeys (private and public), this way it works for all users. Just for >>>>>>> reference: >>>>>>> >>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>>> >>>>>>> You could look into it later. >>>>>>> >>>>>>> == >>>>>>> >>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh >>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>>> >>>>>>> - What about putting: >>>>>>> >>>>>>> LogLevel DEBUG3 >>>>>>> >>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate >>>>>>> before >>>>>>> it fails in verbose mode. >>>>>>> >>>>>>> >>>>>>> -- Reuti >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Tena >>>>>>>> >>>>>>>> >>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? I >>>>>>>>> saw >>>>>>>>> the >>>>>>>>> /Users/tsakai/... in your output. >>>>>>>>> >>>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>>> >>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have made a bit of progress(?)... >>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>>> like: >>>>>>>>>> # machine A >>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>>> This is just an abbreviation or nickname above. To use the specified >>>>>>>>> settings, >>>>>>>>> it's necessary to specify exactly this name. When the settings are the >>>>>>>>> same >>>>>>>>> anyway for all machines, you can use: >>>>>>>>> >>>>>>>>> Host * >>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>> IdentitiesOnly yes >>>>>>>>> BatchMode yes >>>>>>>>> >>>>>>>>> instead. >>>>>>>>> >>>>>>>>> Is this a private cluster (or at least private interfaces)? It would >>>>>>>>> also >>>>>>>>> be >>>>>>>>> an option to use hostbased authentication, which will avoid setting >>>>>>>>> any >>>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>>> >>>>>>>>> -- Reuti >>>>>>>>> >>>>>>>>> >>>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>>> BatchMode yes >>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>> IdentitiesOnly yes >>>>>>>>>> >>>>>>>>>> # machine B >>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>>> BatchMode yes >>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>> IdentitiesOnly yes >>>>>>>>>> >>>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>>> >>>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>>> >>>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>>> mpirun: killing job... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>> ------------------------------------------------------------------------->>> >>> >>> >>>> - >>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>> process >>>>>>>>>> that caused that situation. >>>>>>>>>> >>>>>>>>>> >>> ------------------------------------------------------------------------->>> >>> >>> >>>> - >>>>>>>>>> >>> ------------------------------------------------------------------------->>> >>> >>> >>>> - >>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes >>>>>>>>>> shown >>>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>>> >>>>>>>>>> >>> ------------------------------------------------------------------------->>> >>> >>> >>>> - >>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not >>>>>>>>>> report >>>>>>>>>> back when launched >>>>>>>>>> >>>>>>>>>> Am I making progress? >>>>>>>>>> >>>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>>> problem? >>>>>>>>>> Does someone have an example .ssh/config file I can look at? There >>>>>>>>>> are >>>>>>>>>> so >>>>>>>>>> many keyword-argument paris for this config file and I would like to >>>>>>>>>> look >>>>>>>>>> at >>>>>>>>>> some very basic one that works. >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> Tena Sakai >>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>> >>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>> >>>>>>>>>>> Hi >>>>>>>>>>> >>>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>>> >>>>>>>>>>> The program I run is >>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs. >>>>>>>>>>> >>>>>>>>>>> Here¹s the program fib.R: >>>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>>> >>>>>>>>>>> fib <- function( n ) { >>>>>>>>>>> a <- 0 >>>>>>>>>>> b <- 1 >>>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>>> t <- b >>>>>>>>>>> b <- a >>>>>>>>>>> a <- a + t >>>>>>>>>>> } >>>>>>>>>>> a >>>>>>>>>>> >>>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>>> >>>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>>> number >>>>>>>>>>> that >>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>>> simple >>>>>>>>>>> stuff. >>>>>>>>>>> >>>>>>>>>>> Here¹s the run output: >>>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>>> >>>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>>> >>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of >>>>>>>>>>> the >>>>>>>>>>> same >>>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>>> [tsakai@vixen local]$ ssh A I ~/.ssh/tsakai >>>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>>> >>>>>>>>>>> Now I am on machine A: >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>>> without >>>>>>>>>>> password authentication, >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine >>>>>>>>>>> A >>>>>>>>>>> without using password >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>>> can't >>>>>>>>>>> be established. >>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>>> list >>>>>>>>>>> of >>>>>>>>>>> known hosts. >>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>>> logout >>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>>> logout >>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>> >>>>>>>>>>> As you can see, neither machine uses password for authentication; it >>>>>>>>>>> uses >>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for >>>>>>>>>>> ssh >>>>>>>>>>> invocation >>>>>>>>>>> from one machine to the other. This is so because I have a copy of >>>>>>>>>>> public >>>>>>>>>>> key >>>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>>> >>>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>> >>>>>>>>>>> Here¹s what happens with mpirun: >>>>>>>>>>> >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>>> Permission denied, please try again. >>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>> ----------------------------------------------------------------------->>>>> >>> >>> >>> - >>>>>>>>>>> -- >>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>> process >>>>>>>>>>> that caused that situation. >>>>>>>>>>> >>>>>>>>>>> >>> ----------------------------------------------------------------------->>>>> >>> >>> >>> - >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>> >>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>> >>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have. >>>>>>>>>>> I end up typing control-C. >>>>>>>>>>> >>>>>>>>>>> Here¹s my question: >>>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>>> password? >>>>>>>>>>> >>>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>>> >>>>>>>>>>> Thank you. >>>>>>>>>>> >>>>>>>>>>> Tena Sakai >>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> <session4Reuti.text>_______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users