Hi Ralph Thanks - i downloaded and installed openmpi-1.4a1r20435 and now everything works as it should: --output-filename : all processes write their outputs to the correct files --xterm : all specified processes opened their xterms
I started my application with --xterm as i wrote in a previous mail: - call 'xhost +<remote_node>' for all nodes in my hostfile - export DISPLAY=<my_workstation>:0.0 - call mpirun -np 8 -x DISPLAY --hostfile testhosts --xterm --ranks=2,3,4,5! ./MPITest combining --xterm with --output-filename also worked. Thanks again! Jody On Tue, Feb 3, 2009 at 11:03 PM, Ralph Castain <r...@lanl.gov> wrote: > Hi Jody > > Well, the problem with both the output filename and the xterm option was > that I wasn't passing them back to the remote daemons under the ssh launch > environment. I should have that corrected now - things will hopefully work > with any tarball of r20407 or above. > > Let me know... > Ralph > > On Feb 3, 2009, at 11:34 AM, Ralph Castain wrote: > >> Ah! I know the problem - forgot you are running under ssh, so the >> environment doesn't get passed. >> >> I'll have to find a way to pass the output filename to the backend >> nodes...should have it later today. >> >> >> On Feb 3, 2009, at 11:09 AM, jody wrote: >> >>> Hi Ralph >>>>> >>>>> --output-filename >>>>> It creates files, but only for the local processes: >>>>> [jody@localhost neander]$ mpirun -np 8 -hostfile testhosts >>>>> --output-filename gnana ./MPITest >>>>> ... output ... >>>>> [jody@localhost neander]$ ls -l gna* >>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.0 >>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.1 >>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.2 >>>>> ( i set slots=3 on my workstation) >>>>> >>>> >>>> Did you give a location that is on an NFS mount? >>> >>> Yes, i started mpirun on a drive which all the remote nodes mount as NFS >>> drives. >>>> >>>> I'm willing to bet the files are being created - they are on your remote >>>> nodes. The daemons create their own local files for output from their >>>> local >>>> procs. We decided to do this for scalability reasons - if we have mpirun >>>> open all the output files, then you could easily hit the file descriptor >>>> limit on that node and cause the job not to launch. >>>> >>>> Check your remote nodes and see if the files are there. >>> >>> Where would i have to look? They are not in my home directories on the >>> nodes. >>> >>>> >>>> I can fix that easily enough - we'll just test to see if the xterm >>>> option >>>> has been set, and add the -X to ssh if so. >>>> >>>> Note that you can probably set this yourself right now by -mca >>>> plm_rsh_agent >>>> "ssh -X" >>> >>> I tried this, but it didn't work, though we may be getting there: >>> >>> [jody@localhost neander]$ mpirun -np 8 -mca plm_rsh_agent "ssh -X" >>> -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest >>> Warning: No xauth data; using fake authentication data for X11 >>> forwarding. >>> Warning: No xauth data; using fake authentication data for X11 >>> forwarding. >>> Warning: No xauth data; using fake authentication data for X11 >>> forwarding. >>> ... >>> => The 3 remote processes (3,4,5) tried to get access. >>> >>> I remember having had an xauth problem like this in an other setup >>> before, >>> but i've forgotten how to solve it. I'll try to find out, and get back to >>> you when i figured it out. >>> >>> BTW: calling an X-application over SSH works, e.g. >>> ssh -X node_00 xclock >>> >>> >>> Jody >>>> >>>>> >>>>> >>>>> So what i currently do to have my xterms running: >>>>> on my workstation i call >>>>> xhost + <hostname> for all >>>>> machines in my hostfile, to allow them to use X on my workstation. >>>>> Then i set my DISPLAY variable to point to my workstation >>>>> export DISPLAY=<mymachine>:0.0 >>>>> Finally, i call mpirun with the -x option (to exports the DISPLAY >>>>> variable to all nodes) : >>>>> mpirun -np 4 -hostfile myfiles -x DISPLAY run_xterm.sh MyApplication >>>>> arg1 >>>>> arg2 >>>>> >>>>> Here run_xterm.sh is a shell script which creates a useful title for >>>>> the xterm window >>>>> and calls the application with all its arguments (-hold leaves the >>>>> xterm open after the program terminates): >>>>> #!/bin/sh -f >>>>> >>>>> # feedback for command line >>>>> echo "Running on node `hostname`" >>>>> >>>>> # for version 1.2 use undocumented env variable >>>>> # for version 1.3 use documented env variable >>>>> export ID=$OMPI_COMM_WORLD_RANK >>>>> if [ X$ID = X ]; then >>>>> export ID=$OMPI_MCA_ns_nds_vpid >>>>> fi >>>>> >>>>> export TITLE="node #$ID" >>>>> # start terminal >>>>> xterm -T "$TITLE" -hold -e $* >>>>> >>>>> exit 0 >>>>> >>>>> (i have similar scripts to run gdb or valgrind in xterm windows) >>>>> I know that the 'xhost +' is a horror for certain sysadmins, >>>>> but i feel quite safe, because the machines listed in my hostfile >>>>> are not accessible from outside our department. >>>>> >>>>> I haven't found any other alternative to have nice xterms when i can't >>>>> use 'ssh -X'. >>>>> >>>>> To come back to the '--xterm' option: i just ran my xterm-script after >>>>> doing the above xhost+ and DISPLAY things, and it worked - all local >>>>> and >>>>> remote >>>>> processes created their xterm windows. (In other words, the environment >>>>> was >>>>> set to have my remote nodes use xterms on my workstation.) >>>>> >>>>> Immediately thereafter i called the same application with >>>>> mpirun -np 8 -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest >>>>> but still, only the local process (#2) created an xterm. >>>>> >>>>> >>>>> Do you think it would be possible to have open MPI make its >>>>> ssh-connections with '-X', >>>>> or are there technical or security-related objections? >>>>> >>>>> Regards >>>>> >>>>> Jody >>>>> >>>>> On Mon, Feb 2, 2009 at 4:47 PM, Ralph Castain <r...@lanl.gov> wrote: >>>>>> >>>>>> On Feb 2, 2009, at 2:55 AM, jody wrote: >>>>>> >>>>>>> Hi Ralph >>>>>>> The new options are great stuff! >>>>>>> Following your suggestion, i downloaded and installed >>>>>>> >>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz >>>>>>> >>>>>>> and tested the new options. (i have a simple cluster of >>>>>>> 8 machines over tcp). Not everything worked as specified, though: >>>>>>> * timestamp-output : works >>>>>> >>>>>> good! >>>>>> >>>>>>> >>>>>>> * xterm : doesn't work completely - >>>>>>> comma-separated rank list: >>>>>>> Only for the local processes a xterm is opened. The other processes >>>>>>> (the ones on remote machines) only output to the stdout of the >>>>>>> calling window. >>>>>>> (Just to be sure i started my own script for opening separate xterms >>>>>>> - that did work for the remoties, too) >>>>>> >>>>>> This is a problem we wrestled with for some time. The issue is that we >>>>>> really aren't comfortable modifying the DISPLAY envar on the remote >>>>>> nodes >>>>>> like you do in your script. It is fine for a user to do whatever they >>>>>> want, >>>>>> but for OMPI to do it...that's another matter. We can't even know for >>>>>> sure >>>>>> what to do because of the wide range of scenarios that might occur >>>>>> (e.g., >>>>>> is >>>>>> mpirun local to you, or on a remote node connected to you via xterm, >>>>>> or...?). >>>>>> >>>>>> What you (the user) need to do is ensure that X11 is setup properly so >>>>>> that >>>>>> an Xwindow opened on the remote host is displayed on your screen. In >>>>>> this >>>>>> case, I believe you have to enable xforwarding - I'm not an xterm >>>>>> expert, >>>>>> so >>>>>> I can't advise you on how to do this. Suspect you may already know - >>>>>> in >>>>>> which case, can you please pass it along and I'll add it to our docs? >>>>>> :-) >>>>>> >>>>>>> >>>>>>> >>>>>>> If a '-1' is given instead of a list of ranks, it fails (locally & >>>>>>> with remotes): >>>>>>> [jody@localhost neander]$ mpirun -np 4 --xterm -1 ./MPITest >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> Sorry! You were supposed to get help about: >>>>>>> orte-odls-base:xterm-rank-out-of-bounds >>>>>>> from the file: >>>>>>> help-odls-base.txt >>>>>>> But I couldn't find any file matching that name. Sorry! >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> mpirun was unable to start the specified application as it >>>>>>> encountered an error >>>>>>> on node localhost. More information may be available above. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> Fixed as of r20398 - this was a bug, had an if statement out of >>>>>> sequence. >>>>>> >>>>>> >>>>>>> >>>>>>> * output-filename : doesn't work here: >>>>>>> [jody@localhost neander]$ mpirun -np 4 --output-filename gnagna >>>>>>> ./MPITest >>>>>>> [jody@localhost neander]$ ls -l gna* >>>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-02 09:07 gnagna.%10lu >>>>>>> >>>>>>> There is output from the processes on remote machines on stdout, but >>>>>>> none >>>>>>> from the local ones. >>>>>> >>>>>> Fixed as of r20400 - had a format statement syntax that was okay in >>>>>> some >>>>>> compilers, but not others. >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> A question about installing: i installed the usual way (configure, >>>>>>> make all install), >>>>>>> but the new man-files apparently weren't copied to their destination: >>>>>>> If i do 'man mpirun' i get shown the contents of an old man-file >>>>>>> (without the new options). >>>>>>> I had to do ' less >>>>>>> /opt//openmpi-1.4a1r20394/share/man/man1/mpirun.1' >>>>>>> to see them. >>>>>> >>>>>> Strange - the install should put them in the right place, but I wonder >>>>>> if >>>>>> you updated your manpath to point at it? >>>>>> >>>>>>> >>>>>>> >>>>>>> About the xterm-option : when the application ends all xterms are >>>>>>> closed immediately. >>>>>>> (when doing things 'by hand' i used the -hold option for xterm) >>>>>>> Would it be possible to add this feature for your xterm option? >>>>>>> Perhaps by adding a '!' at the end of the rank list? >>>>>> >>>>>> Done! A "!" at the end of the list will activate -hold as of r20398. >>>>>> >>>>>>> >>>>>>> >>>>>>> About orte_iof: with the new version it works, but no matter which >>>>>>> rank i specify, >>>>>>> it only prints out rank0's output: >>>>>>> [jody@localhost ~]$ orte-iof --pid 31049 --rank 4 --stdout >>>>>>> [localhost]I am #0/9 before the barrier >>>>>>> >>>>>> >>>>>> The problem here is that the option name changed from "rank" to >>>>>> "ranks" >>>>>> since you can now specify any number of ranks as comma-separated >>>>>> ranges. >>>>>> I >>>>>> have updated orte-iof so it will gracefully fail if you provide an >>>>>> unrecognized cmd line option and output the "help" detailing the >>>>>> accepted >>>>>> options. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Jody >>>>>>> >>>>>>> On Sun, Feb 1, 2009 at 10:49 PM, Ralph Castain <r...@lanl.gov> wrote: >>>>>>>> >>>>>>>> I'm afraid we discovered a bug in optimized builds with r20392. >>>>>>>> Please >>>>>>>> use >>>>>>>> any tarball with r20394 or above. >>>>>>>> >>>>>>>> Sorry for the confusion >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> On Feb 1, 2009, at 5:27 AM, Jeff Squyres wrote: >>>>>>>> >>>>>>>>> On Jan 31, 2009, at 11:39 AM, Ralph Castain wrote: >>>>>>>>> >>>>>>>>>> For anyone following this thread: >>>>>>>>>> >>>>>>>>>> I have completed the IOF options discussed below. Specifically, I >>>>>>>>>> have >>>>>>>>>> added the following: >>>>>>>>>> >>>>>>>>>> * a new "timestamp-output" option that timestamp's each line of >>>>>>>>>> output >>>>>>>>>> >>>>>>>>>> * a new "output-filename" option that redirects each proc's output >>>>>>>>>> to >>>>>>>>>> a >>>>>>>>>> separate rank-named file. >>>>>>>>>> >>>>>>>>>> * a new "xterm" option that redirects the output of the specified >>>>>>>>>> ranks >>>>>>>>>> to a separate xterm window. >>>>>>>>>> >>>>>>>>>> You can obtain a copy of the updated code at: >>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz >>>>>>>>> >>>>>>>>> Sweet stuff. :-) >>>>>>>>> >>>>>>>>> Note that the URL/tarball that Ralph cites is a nightly snapshot >>>>>>>>> and >>>>>>>>> will >>>>>>>>> expire after a while -- we only keep the most 5 recent nightly >>>>>>>>> tarballs >>>>>>>>> available. You can find Ralph's new IOF stuff in any 1.4a1 nightly >>>>>>>>> tarball >>>>>>>>> after the one he cited above. Note that the last part of the >>>>>>>>> tarball >>>>>>>>> name >>>>>>>>> refers to the subversion commit number (which increases >>>>>>>>> monotonically); >>>>>>>>> any >>>>>>>>> 1.4 nightly snapshot tarball beyond "r20392" will contain this new >>>>>>>>> IOF >>>>>>>>> stuff. Here's where to get our nightly snapshot tarballs: >>>>>>>>> >>>>>>>>> http://www.open-mpi.org/nightly/trunk/ >>>>>>>>> >>>>>>>>> Don't read anything into the "1.4" version number -- we've just >>>>>>>>> bumped >>>>>>>>> the >>>>>>>>> version number internally to be different than the current stable >>>>>>>>> series >>>>>>>>> (1.3). We haven't yet branched for the v1.4 series; hence, "1.4a1" >>>>>>>>> currently refers to our development trunk. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> Cisco Systems >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >