[OMPI users] orte_pls_base_select fails
Greetings, I'm running the Debian package of OpenMPI in a chroot (with /proc mounted properly), and orte_init is failing as follows: $ uptime 12:51:55 up 12 days, 21:30, 0 users, load average: 0.00, 0.00, 0.00 $ orterun -np 1 uptime [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 312 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_pls_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. -- Note running with -v produces no more output than this. Running orted in the background doesn't seem to help. What could be wrong? Does orterun not run in a chroot environment? What more can I do to investigate further? Thanks, -Adam -- GPG fingerprint: D54D 1AEE B11C CE9B A02B C5DD 526F 01E8 564E E4B6 Welcome to the best software in the world today cafe! http://www.take6.com/albums/greatesthits.html
Re: [OMPI users] orte_pls_base_select fails
Adam C Powell IV wrote: Greetings, I'm running the Debian package of OpenMPI in a chroot (with /proc mounted properly), and orte_init is failing as follows: $ uptime 12:51:55 up 12 days, 21:30, 0 users, load average: 0.00, 0.00, 0.00 $ orterun -np 1 uptime [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 312 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_pls_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. -- Note running with -v produces no more output than this. Running orted in the background doesn't seem to help. What could be wrong? Does orterun not run in a chroot environment? What more can I do to investigate further? Try running mpirun with the added options: -mca orte_debug 1 -mca pls_base_verbose 20 Then send the output to the list. Thanks, Tim Thanks, -Adam
Re: [OMPI users] orte_pls_base_select fails
On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: > Adam C Powell IV wrote: > > Greetings, > > > > I'm running the Debian package of OpenMPI in a chroot (with /proc > > mounted properly), and orte_init is failing as follows: > > [snip] > > What could be wrong? Does orterun not run in a chroot environment? > > What more can I do to investigate further? > Try running mpirun with the added options: > -mca orte_debug 1 -mca pls_base_verbose 20 > > Then send the output to the list. Thanks! Here's the output: $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime [new-host-3:19201] mca: base: components_open: Looking for pls components [new-host-3:19201] mca: base: components_open: distilling pls components [new-host-3:19201] mca: base: components_open: accepting all pls components [new-host-3:19201] mca: base: components_open: opening pls components [new-host-3:19201] mca: base: components_open: found loaded component gridengine[new-host-3:19201] mca: base: components_open: component gridengine open function successful [new-host-3:19201] mca: base: components_open: found loaded component proxy [new-host-3:19201] mca: base: components_open: component proxy open function successful [new-host-3:19201] mca: base: components_open: found loaded component rsh [new-host-3:19201] mca: base: components_open: component rsh open function successful [new-host-3:19201] mca: base: components_open: found loaded component slurm [new-host-3:19201] mca: base: components_open: component slurm open function successful [new-host-3:19201] orte:base:select: querying component gridengine [new-host-3:19201] pls:gridengine: NOT available for selection [new-host-3:19201] orte:base:select: querying component proxy [new-host-3:19201] orte:base:select: querying component rsh [new-host-3:19201] orte:base:select: querying component slurm [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 312 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_pls_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. -- -Adam -- GPG fingerprint: D54D 1AEE B11C CE9B A02B C5DD 526F 01E8 564E E4B6 Welcome to the best software in the world today cafe! http://www.take6.com/albums/greatesthits.html
Re: [OMPI users] orte_pls_base_select fails
This is strange. I assume that you what to use rsh or ssh to launch the processes? If you want to use ssh, does "which ssh" find ssh? Similarly, if you want to use rsh, does "which rsh" find rsh? Thanks, Tim Adam C Powell IV wrote: On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: Adam C Powell IV wrote: Greetings, I'm running the Debian package of OpenMPI in a chroot (with /proc mounted properly), and orte_init is failing as follows: [snip] What could be wrong? Does orterun not run in a chroot environment? What more can I do to investigate further? Try running mpirun with the added options: -mca orte_debug 1 -mca pls_base_verbose 20 Then send the output to the list. Thanks! Here's the output: $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime [new-host-3:19201] mca: base: components_open: Looking for pls components [new-host-3:19201] mca: base: components_open: distilling pls components [new-host-3:19201] mca: base: components_open: accepting all pls components [new-host-3:19201] mca: base: components_open: opening pls components [new-host-3:19201] mca: base: components_open: found loaded component gridengine[new-host-3:19201] mca: base: components_open: component gridengine open function successful [new-host-3:19201] mca: base: components_open: found loaded component proxy [new-host-3:19201] mca: base: components_open: component proxy open function successful [new-host-3:19201] mca: base: components_open: found loaded component rsh [new-host-3:19201] mca: base: components_open: component rsh open function successful [new-host-3:19201] mca: base: components_open: found loaded component slurm [new-host-3:19201] mca: base: components_open: component slurm open function successful [new-host-3:19201] orte:base:select: querying component gridengine [new-host-3:19201] pls:gridengine: NOT available for selection [new-host-3:19201] orte:base:select: querying component proxy [new-host-3:19201] orte:base:select: querying component rsh [new-host-3:19201] orte:base:select: querying component slurm [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 312 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_pls_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. -- -Adam
[OMPI users] Octave MPITB for Open-MPI
MPITB is an Octave toolbox for MPI. The new release also works with Open-MPI. Test reports are welcome, since this is an initial release. For more information see http://atc.ugr.es/javier-bin/mpitb Thanks -javier
Re: [OMPI users] DataTypes with "holes" for writing files
On Tue, Jul 10, 2007 at 04:36:01PM +, jody wrote: > I think there is still some problem. > I create different datatypes by resizing MPI_SHORT with > different negative lower bounds (depending on the rank) > and the same extent (only depending on the number of processes). > > However, I get an error as soon as MPI_File_set_view is called with my new > datatype: > > Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks > [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator > MPI_COMM_WORLD with errorcode 1 Hi Jody I was wrong about this being a problem with OpenMPI's version of ROMIO. The OpenMPI guys have synced up fairly recently with the OpenMPI in MPICH2. ROMIO, even the very latest in CVS version, doesn't support resized types yet. Looks like you'll have to take George's alternate idea of MPI_UB and MPI_LB. We'll let the OpenMPI guys know when resized support is in place. Sorry for the confusion. ==rob -- Rob Latham Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
Re: [OMPI users] orte_pls_base_select fails
As mentioned, I'm running in a chroot environment, so rsh and ssh won't work: "rsh localhost" will rsh into the primary local host environment, not the chroot, which will fail. [The purpose is to be able to build and test MPI programs in the Debian unstable distribution, without upgrading the whole machine to unstable. Though most machines I use for this purpose run Debian stable or testing, the machine I'm currently using runs a very old Fedora, for which I don't think OpenMPI is available.] With MPICH, mpirun -np 1 just runs the new process in the current context, without rsh/ssh, so it works in a chroot. Does OpenMPI not support this functionality? Thanks, Adam On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote: > This is strange. I assume that you what to use rsh or ssh to launch the > processes? > > If you want to use ssh, does "which ssh" find ssh? Similarly, if you > want to use rsh, does "which rsh" find rsh? > > Thanks, > > Tim > > Adam C Powell IV wrote: > > On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: > >> Adam C Powell IV wrote: > >>> Greetings, > >>> > >>> I'm running the Debian package of OpenMPI in a chroot (with /proc > >>> mounted properly), and orte_init is failing as follows: > >>> [snip] > >>> What could be wrong? Does orterun not run in a chroot environment? > >>> What more can I do to investigate further? > >> Try running mpirun with the added options: > >> -mca orte_debug 1 -mca pls_base_verbose 20 > >> > >> Then send the output to the list. > > > > Thanks! Here's the output: > > > > $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime > > [new-host-3:19201] mca: base: components_open: Looking for pls components > > [new-host-3:19201] mca: base: components_open: distilling pls components > > [new-host-3:19201] mca: base: components_open: accepting all pls components > > [new-host-3:19201] mca: base: components_open: opening pls components > > [new-host-3:19201] mca: base: components_open: found loaded component > > gridengine[new-host-3:19201] mca: base: components_open: component > > gridengine open function successful > > [new-host-3:19201] mca: base: components_open: found loaded component proxy > > [new-host-3:19201] mca: base: components_open: component proxy open > > function successful > > [new-host-3:19201] mca: base: components_open: found loaded component rsh > > [new-host-3:19201] mca: base: components_open: component rsh open function > > successful > > [new-host-3:19201] mca: base: components_open: found loaded component slurm > > [new-host-3:19201] mca: base: components_open: component slurm open > > function successful > > [new-host-3:19201] orte:base:select: querying component gridengine > > [new-host-3:19201] pls:gridengine: NOT available for selection > > [new-host-3:19201] orte:base:select: querying component proxy > > [new-host-3:19201] orte:base:select: querying component rsh > > [new-host-3:19201] orte:base:select: querying component slurm > > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file > > runtime/orte_init_stage1.c at line 312 > > -- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > orte_pls_base_select failed > > --> Returned value -1 instead of ORTE_SUCCESS > > > > -- > > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file > > runtime/orte_system_init.c at line 42 > > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file > > runtime/orte_init.c at line 52 > > -- > > Open RTE was unable to initialize properly. The error occured while > > attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. > > -- > > > > -Adam -- GPG fingerprint: D54D 1AEE B11C CE9B A02B C5DD 526F 01E8 564E E4B6 Welcome to the best software in the world today cafe! http://www.take6.com/albums/greatesthits.html
Re: [OMPI users] orte_pls_base_select fails
On 7/18/07 9:49 AM, "Adam C Powell IV" wrote: > As mentioned, I'm running in a chroot environment, so rsh and ssh won't > work: "rsh localhost" will rsh into the primary local host environment, > not the chroot, which will fail. > > [The purpose is to be able to build and test MPI programs in the Debian > unstable distribution, without upgrading the whole machine to unstable. > Though most machines I use for this purpose run Debian stable or > testing, the machine I'm currently using runs a very old Fedora, for > which I don't think OpenMPI is available.] > > With MPICH, mpirun -np 1 just runs the new process in the current > context, without rsh/ssh, so it works in a chroot. Does OpenMPI not > support this functionality? Yes - and no. OpenMPI will launch on a local node without using rsh/ssh. However, and it is a big however, our init code requires that we still identify a working launcher that could be used to launch on remote nodes. Frankly, we never considered the case you describe. We could (and perhaps should) modify the code to allow it to continue even if it doesn't find a viable launcher. I believe our initial thinking was that something that launched only on the local node wasn't much use to MPI and therefore that scenario probably represents an error condition. We'll discuss it and see what we think should be done. Meantime, the answer would have to be "no, we don't support that" Ralph > > Thanks, > Adam > > On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote: >> This is strange. I assume that you what to use rsh or ssh to launch the >> processes? >> >> If you want to use ssh, does "which ssh" find ssh? Similarly, if you >> want to use rsh, does "which rsh" find rsh? >> >> Thanks, >> >> Tim >> >> Adam C Powell IV wrote: >>> On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: Adam C Powell IV wrote: > Greetings, > > I'm running the Debian package of OpenMPI in a chroot (with /proc > mounted properly), and orte_init is failing as follows: > [snip] > What could be wrong? Does orterun not run in a chroot environment? > What more can I do to investigate further? Try running mpirun with the added options: -mca orte_debug 1 -mca pls_base_verbose 20 Then send the output to the list. >>> >>> Thanks! Here's the output: >>> >>> $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime >>> [new-host-3:19201] mca: base: components_open: Looking for pls components >>> [new-host-3:19201] mca: base: components_open: distilling pls components >>> [new-host-3:19201] mca: base: components_open: accepting all pls components >>> [new-host-3:19201] mca: base: components_open: opening pls components >>> [new-host-3:19201] mca: base: components_open: found loaded component >>> gridengine[new-host-3:19201] mca: base: components_open: component >>> gridengine open function successful >>> [new-host-3:19201] mca: base: components_open: found loaded component proxy >>> [new-host-3:19201] mca: base: components_open: component proxy open function >>> successful >>> [new-host-3:19201] mca: base: components_open: found loaded component rsh >>> [new-host-3:19201] mca: base: components_open: component rsh open function >>> successful >>> [new-host-3:19201] mca: base: components_open: found loaded component slurm >>> [new-host-3:19201] mca: base: components_open: component slurm open function >>> successful >>> [new-host-3:19201] orte:base:select: querying component gridengine >>> [new-host-3:19201] pls:gridengine: NOT available for selection >>> [new-host-3:19201] orte:base:select: querying component proxy >>> [new-host-3:19201] orte:base:select: querying component rsh >>> [new-host-3:19201] orte:base:select: querying component slurm >>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file >>> runtime/orte_init_stage1.c at line 312 >>> -- >>> It looks like orte_init failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during orte_init; some of which are due to configuration or >>> environment problems. This failure appears to be an internal failure; >>> here's some additional information (which may only be relevant to an >>> Open MPI developer): >>> >>> orte_pls_base_select failed >>> --> Returned value -1 instead of ORTE_SUCCESS >>> >>> -- >>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file >>> runtime/orte_system_init.c at line 42 >>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c >>> at line 52 >>> -- >>> Open RTE was unable to initialize properly. The error occured while >>> attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. >>>
Re: [OMPI users] orte_pls_base_select fails
Adam C Powell IV wrote: As mentioned, I'm running in a chroot environment, so rsh and ssh won't work: "rsh localhost" will rsh into the primary local host environment, not the chroot, which will fail. [The purpose is to be able to build and test MPI programs in the Debian unstable distribution, without upgrading the whole machine to unstable. Though most machines I use for this purpose run Debian stable or testing, the machine I'm currently using runs a very old Fedora, for which I don't think OpenMPI is available.] Allright, I understand what you are trying to do now. To be honest, I don't think we have ever really thought about this use case. We always figured that to test Open MPI people would simply install it in a different directory and use it from there. With MPICH, mpirun -np 1 just runs the new process in the current context, without rsh/ssh, so it works in a chroot. Does OpenMPI not support this functionality? Open MPI does support this functionality. First, a bit of explanation: We use 'pls' (process launching system) components to handling the launching of processes. There are components for slurm, gridengine, rsh, and others. At runtime we open each of these components and query them as to whether they can be used. The original error you posted says that none of the 'pls' components can be used because all of they detected they could not run in your setup. The slurm one excluded itself because there were no environment variables set indicating it is running under SLURM. Similarly, the gridengine pls said it cannot run as well. The 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are available (I assume this is the case, though you did not explicitly say they were not available). But in this case, you do want the 'rsh' pls to be used. It will automatically fork any local processes, and will user rsh/ssh to launch any remote processes. Again, I don't think we ever imagined the use case on a UNIX-like system where there are no launchers like SLURM available, and rsh/ssh also wasn't available (Open MPI is, after all, primarily concerned with multi-node operation). So, there are several ways around this: 1. Make rsh or ssh available, even though they will not be used. 2. Tell the 'rsh' pls component to use a dummy program such as /bin/false by adding the following to the command line: -mca pls_rsh_agent /bin/false 3. Create a dummy 'rsh' executable that is available in your path. For instance: [tprins@odin ~]$ which ssh /usr/bin/which: no ssh in (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) [tprins@odin ~]$ which rsh /usr/bin/which: no rsh in (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) [tprins@odin ~]$ mpirun -np 1 hostname [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 317 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_pls_base_select failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 46 [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file orterun.c at line 399 [tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false hostname odin.cs.indiana.edu [tprins@odin ~]$ touch usr/bin/rsh [tprins@odin ~]$ chmod +x usr/bin/rsh [tprins@odin ~]$ mpirun -np 1 hostname odin.cs.indiana.edu [tprins@odin ~]$ I hope this helps, Tim Thanks, Adam On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote: This is strange. I assume that you what to use rsh or ssh to launch the processes? If you want to use ssh, does "which ssh" find ssh? Similarly, if you want to use rsh, does "which rsh" find rsh? Thanks, Tim Adam C Powell IV wrote: On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: Adam C Powell IV wrote: Greetings, I'm running the Debian package of OpenMPI in a chroot (with /proc mounted properly), and orte_init is failing as follows: [snip] What could be wrong? Does orterun not run in a chroot environment? What more can I do to investigate further? Try running mpirun with the added options: -mca orte_debug 1 -mca pls_base_verbose 20 Then send the output to the list. Thanks! Here's the output: $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime [new-host-3:19201] mca: base: components_open: Looking for
Re: [OMPI users] mpirun hanging followup
--- Ralph Castain wrote: > No, the session directory is created in the tmpdir - we don't create > anything anywhere else, nor do we write any executables anywhere. In the case where the TMPDIR env variable isn't specified, what is the default assumed by Open MPI/orte? > Just out of curiosity: although I know you have different arch's on > your > nodes, the tests you are running are all executing on the same arch, > correct??? Yes, tests all execute on the same arch, although I am led to another question. Can I use a headnode of a particular arch, but in my mpirun hostfile, specify only nodes of another arch, and launch from the headnode? In other words, no computation is done on the headnode of arch A, all computation is done on nodes of arch B, but the job is launched from the headnode -- would that be acceptable? I should be clear that for the problem you are helping me with, *all* the nodes involved are running the same arch, OS, compiler, system libraries, etc. The multiple arch question is for edification for the future. Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz
Re: [OMPI users] orte_pls_base_select fails
Tim has proposed a clever fix that I had not thought of - just be aware that it could cause unexpected behavior at some point. Still, for what you are trying to do, that might meet your needs. Ralph On 7/18/07 11:44 AM, "Tim Prins" wrote: > Adam C Powell IV wrote: >> As mentioned, I'm running in a chroot environment, so rsh and ssh won't >> work: "rsh localhost" will rsh into the primary local host environment, >> not the chroot, which will fail. >> >> [The purpose is to be able to build and test MPI programs in the Debian >> unstable distribution, without upgrading the whole machine to unstable. >> Though most machines I use for this purpose run Debian stable or >> testing, the machine I'm currently using runs a very old Fedora, for >> which I don't think OpenMPI is available.] > > Allright, I understand what you are trying to do now. To be honest, I > don't think we have ever really thought about this use case. We always > figured that to test Open MPI people would simply install it in a > different directory and use it from there. > >> >> With MPICH, mpirun -np 1 just runs the new process in the current >> context, without rsh/ssh, so it works in a chroot. Does OpenMPI not >> support this functionality? > > Open MPI does support this functionality. First, a bit of explanation: > > We use 'pls' (process launching system) components to handling the > launching of processes. There are components for slurm, gridengine, rsh, > and others. At runtime we open each of these components and query them > as to whether they can be used. The original error you posted says that > none of the 'pls' components can be used because all of they detected > they could not run in your setup. The slurm one excluded itself because > there were no environment variables set indicating it is running under > SLURM. Similarly, the gridengine pls said it cannot run as well. The > 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are > available (I assume this is the case, though you did not explicitly say > they were not available). > > But in this case, you do want the 'rsh' pls to be used. It will > automatically fork any local processes, and will user rsh/ssh to launch > any remote processes. Again, I don't think we ever imagined the use case > on a UNIX-like system where there are no launchers like SLURM > available, and rsh/ssh also wasn't available (Open MPI is, after all, > primarily concerned with multi-node operation). > > So, there are several ways around this: > > 1. Make rsh or ssh available, even though they will not be used. > > 2. Tell the 'rsh' pls component to use a dummy program such as > /bin/false by adding the following to the command line: > -mca pls_rsh_agent /bin/false > > 3. Create a dummy 'rsh' executable that is available in your path. > > For instance: > > [tprins@odin ~]$ which ssh > /usr/bin/which: no ssh in > (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) > [tprins@odin ~]$ which rsh > /usr/bin/which: no rsh in > (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) > [tprins@odin ~]$ mpirun -np 1 hostname > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_init_stage1.c at line 317 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > >orte_pls_base_select failed >--> Returned value Error (-1) instead of ORTE_SUCCESS > > -- > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_system_init.c at line 46 > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 52 > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > orterun.c at line 399 > > [tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false hostname > odin.cs.indiana.edu > > [tprins@odin ~]$ touch usr/bin/rsh > [tprins@odin ~]$ chmod +x usr/bin/rsh > [tprins@odin ~]$ mpirun -np 1 hostname > odin.cs.indiana.edu > [tprins@odin ~]$ > > > I hope this helps, > > Tim > >> >> Thanks, >> Adam >> >> On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote: >>> This is strange. I assume that you what to use rsh or ssh to launch the >>> processes? >>> >>> If you want to use ssh, does "which ssh" find ssh? Similarly, if you >>> want to use rsh, does "which rsh" find rsh? >>> >>> Thanks, >>> >>> Tim >>> >>> Adam C Powell IV wrote: On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote: > Adam C Powell IV wrote: >> Greetings, >> >> I'm running the Debi
Re: [OMPI users] mpirun hanging followup
On 7/18/07 11:46 AM, "Bill Johnstone" wrote: > --- Ralph Castain wrote: > >> No, the session directory is created in the tmpdir - we don't create >> anything anywhere else, nor do we write any executables anywhere. > > In the case where the TMPDIR env variable isn't specified, what is the > default assumed by Open MPI/orte? It rattles through a logic chain: 1. ompi mca param value 2. TMPDIR in environ 3. TMP in environ 4. default to /tmp just so we have something to work with... > >> Just out of curiosity: although I know you have different arch's on >> your >> nodes, the tests you are running are all executing on the same arch, >> correct??? > > Yes, tests all execute on the same arch, although I am led to another > question. Can I use a headnode of a particular arch, but in my mpirun > hostfile, specify only nodes of another arch, and launch from the > headnode? In other words, no computation is done on the headnode of > arch A, all computation is done on nodes of arch B, but the job is > launched from the headnode -- would that be acceptable? As long as the prefix is set such that the correct binary executables can be found, then you should be fine. > > I should be clear that for the problem you are helping me with, *all* > the nodes involved are running the same arch, OS, compiler, system > libraries, etc. The multiple arch question is for edification for the > future. No problem - I just wanted to eliminate one possible complication for now. Thanks Ralph > > > > > __ > __ > Got a little couch potato? > Check out fun summer activities for kids. > http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs= > bz > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun hanging followup
--- Ralph Castain wrote: > Unfortunately, we don't have more debug statements internal to that > function. I'll have to create a patch for you that will add some so > we can > better understand why it is failing - will try to send it to you on > Wed. Thank you for the patch you sent. I solved the problem. It was a head-slapper of an error. Turned out that I had forgotten -- the permissions on the filesystem override the permissions of the mount point. As I mentioned, these machines have an NFS root filesystem. In that filesystem, tmp has permissions 1777. However, when each node mounts its local temp partition to /tmp, the permissions on that filesystem are the permissions the mount point takes on. In this case, I had forgotten to apply permissions 1777 to /tmp after mounting on each machine. As a result, /tmp really did not have the appropriate permissions for mpirun to write to it as necessary. Your patch helped me figure this out. Technically, I should have been able to figure it out from the messages you'd already sent to the mailing list, but it wasn't until I saw the line in session_dir.c where the error was occurring that I realized it had to be some kind of permissions error. I've attached the new debug output below: [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 108 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 391 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 626 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_session_dir failed --> Returned value -1 instead of ORTE_SUCCESS -- [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. Starting at line 108 of session_dir.c, is: if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode))) { ORTE_ERROR_LOG(ret); } Three further points: -Is there some reason ORTE can't bail out gracefully upon this error, instead of hanging like it was doing for me? -I think leaving in the extra debug logging code you sent me in the patch for future Open MPI versions would be a good idea to help troubleshoot problems like this. -It would be nice to see "--debug-daemons" added to the Troubleshooting section of the FAQ on the web site. Thank you very very much for your help Ralph and everyone else that replied. Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC
Re: [OMPI users] orte_pls_base_select fails
On Wed, 2007-07-18 at 13:44 -0400, Tim Prins wrote: > Adam C Powell IV wrote: > > As mentioned, I'm running in a chroot environment, so rsh and ssh won't > > work: "rsh localhost" will rsh into the primary local host environment, > > not the chroot, which will fail. > > > > [The purpose is to be able to build and test MPI programs in the Debian > > unstable distribution, without upgrading the whole machine to unstable. > > Though most machines I use for this purpose run Debian stable or > > testing, the machine I'm currently using runs a very old Fedora, for > > which I don't think OpenMPI is available.] > > Allright, I understand what you are trying to do now. To be honest, I > don't think we have ever really thought about this use case. We always > figured that to test Open MPI people would simply install it in a > different directory and use it from there. > > > With MPICH, mpirun -np 1 just runs the new process in the current > > context, without rsh/ssh, so it works in a chroot. Does OpenMPI not > > support this functionality? > > Open MPI does support this functionality. First, a bit of explanation: > > We use 'pls' (process launching system) components to handling the > launching of processes. There are components for slurm, gridengine, rsh, > and others. At runtime we open each of these components and query them > as to whether they can be used. The original error you posted says that > none of the 'pls' components can be used because all of they detected > they could not run in your setup. The slurm one excluded itself because > there were no environment variables set indicating it is running under > SLURM. Similarly, the gridengine pls said it cannot run as well. The > 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are > available (I assume this is the case, though you did not explicitly say > they were not available). > > But in this case, you do want the 'rsh' pls to be used. It will > automatically fork any local processes, and will user rsh/ssh to launch > any remote processes. Again, I don't think we ever imagined the use case > on a UNIX-like system where there are no launchers like SLURM > available, and rsh/ssh also wasn't available (Open MPI is, after all, > primarily concerned with multi-node operation). > > So, there are several ways around this: > > 1. Make rsh or ssh available, even though they will not be used. > > 2. Tell the 'rsh' pls component to use a dummy program such as > /bin/false by adding the following to the command line: > -mca pls_rsh_agent /bin/false > > 3. Create a dummy 'rsh' executable that is available in your path. > > For instance: > > [tprins@odin ~]$ which ssh > /usr/bin/which: no ssh in > (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) > [tprins@odin ~]$ which rsh > /usr/bin/which: no rsh in > (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin) > [tprins@odin ~]$ mpirun -np 1 hostname > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_init_stage1.c at line 317 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > >orte_pls_base_select failed >--> Returned value Error (-1) instead of ORTE_SUCCESS > > -- > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_system_init.c at line 46 > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 52 > [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file > orterun.c at line 399 > > [tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false hostname > odin.cs.indiana.edu > > [tprins@odin ~]$ touch usr/bin/rsh > [tprins@odin ~]$ chmod +x usr/bin/rsh > [tprins@odin ~]$ mpirun -np 1 hostname > odin.cs.indiana.edu > [tprins@odin ~]$ > > > I hope this helps, > > Tim Yes, this helps tremendously. I installed rsh, and now it pretty much works. The one missing detail is that I can't seem to get the stdout/stderr output. For example: $ orterun -np 1 uptime $ uptime 18:24:27 up 13 days, 3:03, 0 users, load average: 0.00, 0.03, 0.00 The man page indicates that stdout/stderr is supposed to come back to the stdout/stderr of the orterun process. Any ideas on why this isn't working? Thank you again! -Adam -- GPG fingerprint: D54D 1AEE B11C CE9B A02B C5DD 526F 01E8 564E E4B6 Welcome to the best software in the world today cafe! http://www.take6.com/albums/greatesthits.html
Re: [OMPI users] mpirun hanging followup
Hooray! Glad we could help track this down - sorry it was so hard to do so. To answer your questions: 1. Yes - ORTE should bail out gracefully. It definitely should not hang. I will log the problem and investigate. I believe I know where the problem lies, and it may already be fixed on our trunk, but the fix may not get into the 1.2 family (have to see what it would entail). 2. I will definitely commit that debug code and ensure it is in future releases. 3. I'll see if we can add something about --debug-daemons to the FAQ - thanks for pointing out that oversight. Thanks Ralph On 7/18/07 12:19 PM, "Bill Johnstone" wrote: > > --- Ralph Castain wrote: > >> Unfortunately, we don't have more debug statements internal to that >> function. I'll have to create a patch for you that will add some so >> we can >> better understand why it is failing - will try to send it to you on >> Wed. > > Thank you for the patch you sent. > > I solved the problem. It was a head-slapper of an error. Turned out > that I had forgotten -- the permissions on the filesystem override the > permissions of the mount point. As I mentioned, these machines have an > NFS root filesystem. In that filesystem, tmp has permissions 1777. > However, when each node mounts its local temp partition to /tmp, the > permissions on that filesystem are the permissions the mount point > takes on. > > In this case, I had forgotten to apply permissions 1777 to /tmp after > mounting on each machine. As a result, /tmp really did not have the > appropriate permissions for mpirun to write to it as necessary. > > Your patch helped me figure this out. Technically, I should have been > able to figure it out from the messages you'd already sent to the > mailing list, but it wasn't until I saw the line in session_dir.c where > the error was occurring that I realized it had to be some kind of > permissions error. > > I've attached the new debug output below: > > [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file > util/session_dir.c at line 108 > [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file > util/session_dir.c at line 391 > [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file > runtime/orte_init_stage1.c at line 626 > -- > It looks like orte_init failed for some reason; your parallel process > is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_session_dir failed > --> Returned value -1 instead of ORTE_SUCCESS > > -- > [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file > runtime/orte_system_init.c at line 42 > [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 52 > Open RTE was unable to initialize properly. The error occured while > attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. > > Starting at line 108 of session_dir.c, is: > > if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode))) > { > ORTE_ERROR_LOG(ret); > } > > Three further points: > > -Is there some reason ORTE can't bail out gracefully upon this error, > instead of hanging like it was doing for me? > > -I think leaving in the extra debug logging code you sent me in the > patch for future Open MPI versions would be a good idea to help > troubleshoot problems like this. > > -It would be nice to see "--debug-daemons" added to the Troubleshooting > section of the FAQ on the web site. > > Thank you very very much for your help Ralph and everyone else that replied. > > > > __ > __ > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, > news, photos & more. > http://mobile.yahoo.com/go?refer=1GNXIC > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orte_pls_base_select fails
> Yes, this helps tremendously. I installed rsh, and now it pretty much > works. Glad this worked out for you. > > The one missing detail is that I can't seem to get the stdout/stderr > output. For example: > > $ orterun -np 1 uptime > $ uptime > 18:24:27 up 13 days, 3:03, 0 users, load average: 0.00, 0.03, 0.00 > > The man page indicates that stdout/stderr is supposed to come back to > the stdout/stderr of the orterun process. Any ideas on why this isn't > working? It should work. However, we currently have some I/O forwarding problems which show up in some environments that will (hopefully) be fixed in the next release. As far as I know, the problem seems to happen mostly with non-mpi applications. Try running a simple mpi application, such as: #include #include "mpi.h" int main(int argc, char* argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world, I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } If that works fine, then it is probably our problem, and not a problem with your setup. Sorry I don't have a better answer :( Tim
Re: [OMPI users] Problems running openmpi under os x
Brian, To close this one off, we found that one of our libraries has a malloc/free that was being called from ompi. I should have looked at the crash reporter. It reported Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_INVALID_ADDRESS (0x0001) at 0x05801bfc Thread 0 Crashed: 0 libcasa_casa.dylib 0x0107b319 free + 51 1 libopen-pal.0.dylib 0x0289eff9 opal_install_dirs_expand + 467 (installdirs_base_expand.c:68) 2 libopen-pal.0.dylib 0x0289e5a0 opal_installdirs_base_open + 1115 (installdirs_base_components.c:96) 3 libopen-pal.0.dylib 0x0287ba40 opal_init_util + 217 (opal_init.c: 150) 4 libopen-pal.0.dylib 0x0287bb24 opal_init + 24 (opal_init.c:200) 5 libmpi.0.dylib 0x01d745cd ompi_mpi_init + 33 (ompi_mpi_init.c:219) 6 libmpi.0.dylib 0x01db48db MPI_Init + 293 (init.c:71) 7 ctest 0x2f90 main + 24 (ctest.cc:4) 8 ctest 0x2906 _start + 216 9 ctest 0x282d start + 41 On looking into this more, we found that the Lea Malloc was used in the casa_casa library. Removing it cured the problem. Thanks for the help, Tim On 12/07/2007, at 2:54 PM, Tim Cornwell wrote: Brian, I think it's just a symbol clash. A test program linked with just mpicxx works fine but with our typical link, it fails. I've narrowed the problem down to a single shared library. This is from C ++ and the symbols have a namespace casa. Weeding out all the the casa stuff and some other cruft, we're left with: 0009df14 T QuantaProxy::fits() 0011277c S int __gnu_cxx::__capture_isnan(double) 0014b4ae S std::invalid_argument::~invalid_argument() 0014b48e S std::invalid_argument::~invalid_argument() 00112790 S int std::isnan(double) 001200e8 S void** std::fill_n(void**, unsigned int, void* const&) 0012da12 S std::complex* std::fill_n*, unsigned int, std::complex >(std::complex*, unsigned int, std::complex const&) 0012d9ae S std::complex* std::fill_n*, unsigned int, std::complex >(std::complex*, unsigned int, std::complex const&) 00104a4c S bool* std::fill_n(bool*, unsigned int, bool const&) 0010b126 S double* std::fill_n (double*, unsigned int, double const&) 0012043a S float* std::fill_n(float*, unsigned int, float const&) 00120386 S int* std::fill_n(int*, unsigned int, int const&) 001203e0 S unsigned int* std::fill_nunsigned int>(unsigned int*, unsigned int, unsigned int const&) 00120322 S short* std::fill_n(short*, unsigned int, short const&) 0012d94a S unsigned short* std::fill_nint, unsigned short>(unsigned short*, unsigned int, unsigned short const&) 00112bf6 S void std::__reverse<__gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > > >(__gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > >, __gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > >, std::random_access_iterator_tag) 00112bbc S __gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > > std::transform<__gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > >, __gnu_cxx::__normal_iteratorstd::basic_string, std::allocator > >, int (*)(int)> (__gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >, __gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >, __gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >, int (*)(int)) 00198740 S typeinfo for std::invalid_argument 00192cac S typeinfo name for std::invalid_argument 001993e0 S vtable for std::invalid_argument We're all using the standard of OS X: $ mpicxx -v Using built-in specs. Target: i686-apple-darwin8 Configured with: /private/var/tmp/gcc/gcc-5367.obj~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^ [cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 -- with-slibdir=/usr/lib --build=powerpc-apple-darwin8 --with- arch=nocona --with-tune=generic --program-prefix= --host=i686-apple- darwin8 --target=i686-apple-darwin8 Thread model: posix gcc version 4.0.1 (Apple Computer, Inc. build 5367) Tim On 12/07/2007, at 7:57 AM, Brian Barrett wrote: That's unexpected. If you run the command 'ompi_info --all', it should list (towards the top) things like the Bindir and Libdir. Can you see if those have sane values? If they do, can you try running a simple hello, world type MPI application (there's one in the OMPI tarball). It almost looks like memory is getting corrupted, which would be very unexpected that early in the process. I'm unable to duplicate the problem with 1.2.3 on my Mac Pro, making it all the more strange. Another random thought -- Which compilers did you use to build Open MPI? Brian On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote: Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3
Re: [OMPI users] orte_pls_base_select fails
Hi Tim, Thanks for the follow-up On 18 July 2007 at 17:22, Tim Prins wrote: | | > Yes, this helps tremendously. I installed rsh, and now it pretty much | > works. | Glad this worked out for you. | | > | > The one missing detail is that I can't seem to get the stdout/stderr | > output. For example: | > | > $ orterun -np 1 uptime | > $ uptime | > 18:24:27 up 13 days, 3:03, 0 users, load average: 0.00, 0.03, 0.00 | > | > The man page indicates that stdout/stderr is supposed to come back to | > the stdout/stderr of the orterun process. Any ideas on why this isn't | > working? | It should work. However, we currently have some I/O forwarding problems which | show up in some environments that will (hopefully) be fixed in the next | release. As far as I know, the problem seems to happen mostly with non-mpi | applications. | | Try running a simple mpi application, such as: | | #include | #include "mpi.h" | | int main(int argc, char* argv[]) | { | int rank, size; | | MPI_Init(&argc, &argv); | MPI_Comm_rank(MPI_COMM_WORLD, &rank); | MPI_Comm_size(MPI_COMM_WORLD, &size); | printf("Hello, world, I am %d of %d\n", rank, size); | MPI_Finalize(); | | return 0; | } | | If that works fine, then it is probably our problem, and not a problem with | your setup. | | Sorry I don't have a better answer :( That works (and I use the same Debian openmpi 1.2.3-1 set of packages Adam has): edd@basebud:~> opalcc -o /tmp/openmpitest /tmp/openmpitest.c -lmpi edd@basebud:~> orterun -np 4 /tmp/openmpitest Hello, world, I am 2 of 4 Hello, world, I am 1 of 4 Hello, world, I am 0 of 4 Hello, world, I am 3 of 4 edd@basebud:~> I was toying with this at work earlier, and it was hanging there (using hostname or uptime as the token binaries) as soon as I increased the np parameter beyond 1. It works here: edd@basebud:~> orterun -np 4 hostname basebud basebud basebud basebud edd@basebud:~> I have slurm-llnl test packages installed at work but not here. Maybe I need to a dig a bit more into slurm. (Adam: slurm package should be forthcoming. I can point you to the snapshots from the fellow whom I mentor on this.) Dirk -- Hell, there are no rules here - we're trying to accomplish something. -- Thomas A. Edison