It is concerning if the pipe system call fails - I can't think of why that would happen. Thats not usually a permissions issue but rather a deeper indication that something is either seriously wrong on your system or you are running out of file descriptors. Are file descriptors limited on a per-process basis, perchance?
Sent from my PDA. No type good. On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Hi Tena > > Since root can but you can't, > is is a directory permission problem perhaps? > Check the execution directory permission (on both machines, > if this is not NFS mounted dir). > I am not sure, but IIRR OpenMPI also uses /tmp for > under-the-hood stuff, worth checking permissions there also. > Just a naive guess. > > Congrats for all the progress with the cloudy MPI! > > Gus Correa > > Tena Sakai wrote: >> Hi, >> I have made a bit more progress. I think I can say ssh authenti- >> cation problem is behind me now. I am still having a problem running >> mpirun, but the latest discovery, which I can reproduce, is that >> I can run mpirun as root. Here's the session log: >> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ ll >> total 8 >> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >> total 16 >> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >> [tsakai@ip-10-100-243-195 ~]$ hostname >> ip-10-100-243-195 >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ ll >> total 8 >> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >> [tsakai@ip-10-100-243-195 ~]$ >> [tsakai@ip-10-100-243-195 ~]$ exit >> logout >> Connection to ip-10-100-243-195.ec2.internal closed. >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ hostname >> ip-10-195-198-31 >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >> -------------------------------------------------------------------------- >> mpirun was unable to launch the specified application as it encountered an >> error: >> Error: pipe function call failed when setting up I/O forwarding subsystem >> Node: ip-10-195-198-31 >> while attempting to start process rank 0. >> -------------------------------------------------------------------------- >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ # try it as root >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ sudo su >> bash-3.2# >> bash-3.2# pwd >> /home/tsakai >> bash-3.2# >> bash-3.2# ls -l /root/.ssh/config >> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >> bash-3.2# >> bash-3.2# cat /root/.ssh/config >> Host * >> IdentityFile /root/.ssh/.derobee/.kagi >> IdentitiesOnly yes >> BatchMode yes >> bash-3.2# >> bash-3.2# pwd >> /home/tsakai >> bash-3.2# >> bash-3.2# ls -l >> total 8 >> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >> bash-3.2# >> bash-3.2# # now is the time for mpirun >> bash-3.2# >> bash-3.2# mpirun --app ./app.ac >> 13 ip-10-100-243-195 >> 21 ip-10-100-243-195 >> 5 ip-10-195-198-31 >> 8 ip-10-195-198-31 >> bash-3.2# >> bash-3.2# # It works (being root)! >> bash-3.2# >> bash-3.2# exit >> exit >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >> -------------------------------------------------------------------------- >> mpirun was unable to launch the specified application as it encountered an >> error: >> Error: pipe function call failed when setting up I/O forwarding subsystem >> Node: ip-10-195-198-31 >> while attempting to start process rank 0. >> -------------------------------------------------------------------------- >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >> [tsakai@ip-10-195-198-31 ~]$ >> [tsakai@ip-10-195-198-31 ~]$ exit >> logout >> [tsakai@vixen ec2]$ >> So, why does it say "pipe function call failed when setting up >> I/O forwarding subsystem Node: ip-10-195-198-31" ? >> The node it is referring to is not the remote machine. It is >> What I call machine A. I first thought maybe this is a problem >> With PATH variable. But I don't think so. I compared root's >> Path to that of tsaki's and made them identical and retried. >> I got the same behavior. >> If you could enlighten me why this is happening, I would really >> Appreciate it. >> Thank you. >> Tena >> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>> Hi jeff, >>> >>> Thanks for the firewall tip. I tried it while allowing all tip traffic >>> and got interesting and preplexing result. Here's what's interesting >>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run): >>> >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>> Host key verification failed. >>> >>> -------------------------------------------------------------------------- >>> A daemon (pid 2743) died unexpectedly with status 255 while attempting >>> to launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>> the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> >>> -------------------------------------------------------------------------- >>> mpirun: clean termination accomplished >>> >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>> /usr/local/lib >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>> Warning: Identity file tsakai not accessible: No such file or directory. >>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>> [tsakai@ip-10-195-171-159 ~]$ >>> [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>> [tsakai@ip-10-195-171-159 ~]$ >>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>> LD_LIBRARY_PATH=/usr/local/lib >>> [tsakai@ip-10-195-171-159 ~]$ >>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>> [tsakai@ip-10-195-171-159 ~]$ exit >>> logout >>> Connection to ip-10-195-171-159 closed. >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ hostname >>> ip-10-203-21-132 >>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>> Host key verification failed. >>> >>> -------------------------------------------------------------------------- >>> A daemon (pid 2789) died unexpectedly with status 255 while attempting >>> to launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>> the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> >>> -------------------------------------------------------------------------- >>> mpirun: clean termination accomplished >>> >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>> /usr/local/lib... >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>> total 16604 >>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>> libfuse.so.2.8.5 >>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>> libfuse.so.2.8.5 >>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>> libmca_common_sm.so.1.0.0 >>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 -> >>> libmca_common_sm.so.1.0.0 >>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> libmpi.so.0.0.2 >>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>> libmpi.so.0.0.2 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>> libmpi_cxx.so.0.0.1 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>> libmpi_cxx.so.0.0.1 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>> libmpi_f77.so.0.0.1 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>> libmpi_f77.so.0.0.1 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>> libmpi_f90.so.0.0.1 >>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>> libmpi_f90.so.0.0.1 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>> libopen-pal.so.0.0.0 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>> libopen-pal.so.0.0.0 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>> libopen-rte.so.0.0.0 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>> libopen-rte.so.0.0.0 >>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>> libopenmpi_malloc.so.0.0.0 >>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 -> >>> libopenmpi_malloc.so.0.0.0 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>> libulockmgr.so.1.0.1 >>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>> libulockmgr.so.1.0.1 >>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>> libxml2.so.2.7.2 >>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>> libxml2.so.2.7.2 >>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>> [tsakai@ip-10-203-21-132 ~]$ >>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>> [tsakai@ip-10-203-21-132 ~]$ >>> >>> Do you know why it's complaining about shared libraries? >>> >>> Thank you. >>> >>> Tena >>> >>> >>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>> >>>> Your prior mails were about ssh issues, but this one sounds like you might >>>> have firewall issues. >>>> >>>> That is, the "orted" command attempts to open a TCP socket back to mpirun >>>> for >>>> various command and control reasons. If it is blocked from doing so by a >>>> firewall, Open MPI won't run. In general, you can either disable your >>>> firewall or you can setup a trust relationship for TCP connections within >>>> your >>>> cluster. >>>> >>>> >>>> >>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>> >>>>> Hi Reuti, >>>>> >>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>> session is captured in the attached file. >>>>> >>>>> What I did is much similar to what I have done before: verify >>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>> from an scp invocation and then from mpirun invocation. They both >>>>> say >>>>> debug1: Authentication succeeded (publickey). >>>>> >>>>>> From mpirun invocation, I see a line: >>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs >>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>> The IP address at the end of the line is indeed that of machine B. >>>>> After that there was hanging and I controlled-C out of it, which >>>>> gave me more lines. But the lines after >>>>> debug1: Sending command: orted bla bla bla >>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>> mean. >>>>> >>>>> If you could shed some light, I would appreciate it very much. >>>>> >>>>> Regards, >>>>> >>>>> Tena >>>>> >>>>> >>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>> >>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>> No, my environment is entirely linux. The path to my home >>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>> okay. Sometimes the protection of the home directory must be adjusted >>>>>> too, >>>>>> but >>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>> >>>>>> >>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>> ssh-keys for each user. >>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>> but to address from one amazon cluster node to the other I >>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>> change from a launch to another. I am using passphrasesless >>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>> can't be established. >>>>>>> RSA key fingerprint is >>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>> to which I say "yes." >>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>> No. It's convenient inside a private cluster as it won't fill each users' >>>>>> known_hosts file and you don't need to create any ssh-keys. But when the >>>>>> hostname changes every time it might also create new hostkeys. It uses >>>>>> hostkeys (private and public), this way it works for all users. Just for >>>>>> reference: >>>>>> >>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>> >>>>>> You could look into it later. >>>>>> >>>>>> == >>>>>> >>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh >>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>> >>>>>> - What about putting: >>>>>> >>>>>> LogLevel DEBUG3 >>>>>> >>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate >>>>>> before >>>>>> it fails in verbose mode. >>>>>> >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Tena >>>>>>> >>>>>>> >>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> your local machine is Linux like, but the execution hosts are Macs? I >>>>>>>> saw >>>>>>>> the >>>>>>>> /Users/tsakai/... in your output. >>>>>>>> >>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>> >>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have made a bit of progress(?)... >>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>> like: >>>>>>>>> # machine A >>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>> This is just an abbreviation or nickname above. To use the specified >>>>>>>> settings, >>>>>>>> it's necessary to specify exactly this name. When the settings are the >>>>>>>> same >>>>>>>> anyway for all machines, you can use: >>>>>>>> >>>>>>>> Host * >>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>> IdentitiesOnly yes >>>>>>>> BatchMode yes >>>>>>>> >>>>>>>> instead. >>>>>>>> >>>>>>>> Is this a private cluster (or at least private interfaces)? It would >>>>>>>> also >>>>>>>> be >>>>>>>> an option to use hostbased authentication, which will avoid setting any >>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> >>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>> BatchMode yes >>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>> IdentitiesOnly yes >>>>>>>>> >>>>>>>>> # machine B >>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>> BatchMode yes >>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>> IdentitiesOnly yes >>>>>>>>> >>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>> >>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>> >>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>> mpirun: killing job... >>>>>>>>> >>>>>>>>> >>>>>>>>> >> ------------------------------------------------------------------------->>>>>> >>> - >>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>> process >>>>>>>>> that caused that situation. >>>>>>>>> >>>>>>>>> >> ------------------------------------------------------------------------->>>>>> >>> - >>>>>>>>> >> ------------------------------------------------------------------------->>>>>> >>> - >>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>> >>>>>>>>> >> ------------------------------------------------------------------------->>>>>> >>> - >>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not report >>>>>>>>> back when launched >>>>>>>>> >>>>>>>>> Am I making progress? >>>>>>>>> >>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>> problem? >>>>>>>>> Does someone have an example .ssh/config file I can look at? There >>>>>>>>> are >>>>>>>>> so >>>>>>>>> many keyword-argument paris for this config file and I would like to >>>>>>>>> look >>>>>>>>> at >>>>>>>>> some very basic one that works. >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Tena Sakai >>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>> >>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>> >>>>>>>>>> The program I run is >>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs. >>>>>>>>>> >>>>>>>>>> Here¹s the program fib.R: >>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>> >>>>>>>>>> fib <- function( n ) { >>>>>>>>>> a <- 0 >>>>>>>>>> b <- 1 >>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>> t <- b >>>>>>>>>> b <- a >>>>>>>>>> a <- a + t >>>>>>>>>> } >>>>>>>>>> a >>>>>>>>>> >>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>> >>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>> number >>>>>>>>>> that >>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>> simple >>>>>>>>>> stuff. >>>>>>>>>> >>>>>>>>>> Here¹s the run output: >>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>> >>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>> >>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of >>>>>>>>>> the >>>>>>>>>> same >>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>> [tsakai@vixen local]$ ssh A I ~/.ssh/tsakai >>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>> >>>>>>>>>> Now I am on machine A: >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>> without >>>>>>>>>> password authentication, >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine A >>>>>>>>>> without using password >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>> can't >>>>>>>>>> be established. >>>>>>>>>> RSA key fingerprint is >>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>> list >>>>>>>>>> of >>>>>>>>>> known hosts. >>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>> logout >>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>> logout >>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>> >>>>>>>>>> As you can see, neither machine uses password for authentication; it >>>>>>>>>> uses >>>>>>>>>> public/private key pairs. There is no problem (that I can see) for >>>>>>>>>> ssh >>>>>>>>>> invocation >>>>>>>>>> from one machine to the other. This is so because I have a copy of >>>>>>>>>> public >>>>>>>>>> key >>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>> >>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>> >>>>>>>>>> Here¹s what happens with mpirun: >>>>>>>>>> >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>> Permission denied, please try again. >>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >> ----------------------------------------------------------------------->>>>>>>> >> - >>>>>>>>>> -- >>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>> process >>>>>>>>>> that caused that situation. >>>>>>>>>> >>>>>>>>>> >> ----------------------------------------------------------------------->>>>>>>> >> - >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>> >>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>> >>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have. >>>>>>>>>> I end up typing control-C. >>>>>>>>>> >>>>>>>>>> Here¹s my question: >>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>> password? >>>>>>>>>> >>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> Tena Sakai >>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> <session4Reuti.text>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users