It must be making contact or ORTE wouldn't be attempting to launch your 
application's procs. Looks more like it never received the launch command. 
Looking at the code, I suspect you're getting caught in a race condition that 
causes the message to get "stuck".

Just to see if that's the case, you might try running this with the 1.7 release 
candidate, or even the developer's nightly build. Both use a different timing 
mechanism intended to resolve such situations.


On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

> Thank you for the help so far.  Here is the information that the debugging 
> gives me.  Looks like the daemon on on the non-local node never makes 
> contact.  If I step NP back two though, it does.
> 
> Dan
> 
> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
> odls_base_verbose 5 hostname
> [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default]
> [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default]
> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating 
> nidmap
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 
> data to launch job [49524,1]
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new 
> jobdat for job [49524,1]
> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 
> app_contexts
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],0] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],1] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],1] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],2] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],3] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],3] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],4] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],5] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],5] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],6] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],7] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],7] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],8] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],9] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],9] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],10] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],11] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],11] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],12] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],13] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],13] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],14] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],15] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],15] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],16] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],17] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],17] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],18] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],19] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],19] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],20] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],21] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],21] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],22] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],23] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],23] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],24] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],25] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],25] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],26] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],27] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],27] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],28] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],29] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],29] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],30] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],31] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],31] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],32] on daemon 1
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
> checking proc [[49524,1],33] on daemon 0
> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
> proc [[49524,1],33] for me!
> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local list
> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors for 
> 17 children and locally set oversubscribed to false
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],1]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],3]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],5]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],7]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],9]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],11]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],13]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],15]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],17]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],19]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],21]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],23]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],25]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],27]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],29]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],31]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
> [[49524,1],33]
> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job [49524,1] 
> launch status
> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch report to 
> myself
> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44857 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44858 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44859 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44860 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44861 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44862 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44863 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44865 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44866 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44867 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44869 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44870 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44871 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44872 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44873 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44874 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
> 44875 terminated
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],33] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],31] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],29] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],27] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],25] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],23] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],21] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],19] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],17] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],15] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],13] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],11] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],9] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],7] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],5] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],3] terminated normally
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
> [[49524,1],1] terminated normally
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> compute-2-1.local
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],25]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],15]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],11]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],13]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],19]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],9]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],17]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],31]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],7]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],21]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],5]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],33]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],23]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],3]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],29]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],27]
> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
> [[49524,1],1]
> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all 
> procs in [49524,1] terminated
> ^Cmpirun: killing job...
> 
> Killed by signal 2.
> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on 
> WILDCARD
> 
> 
> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>> Sorry - I forgot that you built from a tarball, and so debug isn't enabled 
>> by default. You need to configure --enable-debug.
>> 
>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>> 
>>> Oddly enough, adding this debugging info, lowered the number of processes 
>>> that can be used down to 42 from 46.  When I run the MPI, it fails giving 
>>> only the information that follows:
>>> 
>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>> compute-2-0,compute-2-1 -v  -np 44 --leave-session-attached -mca 
>>> odls_base_verbose 5 hostname
>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying component 
>>> [default]
>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of component 
>>> [default] set priority to 1
>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected component 
>>> [default]
>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying component 
>>> [default]
>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of component 
>>> [default] set priority to 1
>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected component 
>>> [default]
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> compute-2-1.local
>>> 
>>> 
>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>> It wouldn't be ssh - in both cases, only one ssh is being done to each 
>>>> node (to start the local daemon). The only difference is the number of 
>>>> fork/exec's being done on each node, and the number of file descriptors 
>>>> being opened to support those fork/exec's.
>>>> 
>>>> It certainly looks like your limits are high enough. When you say it 
>>>> "fails", what do you mean - what error does it report? Try adding:
>>>> 
>>>> --leave-session-attached -mca odls_base_verbose 5
>>>> 
>>>> to your cmd line - this will report all the local proc launch debug and 
>>>> hopefully show you a more detailed error report.
>>>> 
>>>> 
>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>>> 
>>>>> I have had to cobble together two machines in our rocks cluster without 
>>>>> using the standard installation, they have efi only bios on them and 
>>>>> rocks doesnt like that, so it is the only workaround.
>>>>> 
>>>>> Everything works great now, except for one thing.  MPI jobs (openmpi or 
>>>>> mpich) fail when started from one of these nodes (via qsub or by logging 
>>>>> in and running the command) if 24 or more processors are needed on 
>>>>> another system.  However if the originator of the MPI job is the headnode 
>>>>> or any of the preexisting compute nodes, it works fine.  Right now I am 
>>>>> guessing ssh client or ulimit problems, but I cannot find any difference. 
>>>>>  Any help would be greatly appreciated.
>>>>> 
>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>> 
>>>>> Examples:
>>>>> 
>>>>> This works, prints 23 hostnames from each machine:
>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>> 
>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>> 
>>>>> These both work, print 64 hostnames from each node
>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>> 
>>>>> [root@compute-2-1 ~]# ulimit -a
>>>>> core file size          (blocks, -c) 0
>>>>> data seg size           (kbytes, -d) unlimited
>>>>> scheduling priority             (-e) 0
>>>>> file size               (blocks, -f) unlimited
>>>>> pending signals                 (-i) 16410016
>>>>> max locked memory       (kbytes, -l) unlimited
>>>>> max memory size         (kbytes, -m) unlimited
>>>>> open files                      (-n) 4096
>>>>> pipe size            (512 bytes, -p) 8
>>>>> POSIX message queues     (bytes, -q) 819200
>>>>> real-time priority              (-r) 0
>>>>> stack size              (kbytes, -s) unlimited
>>>>> cpu time               (seconds, -t) unlimited
>>>>> max user processes              (-u) 1024
>>>>> virtual memory          (kbytes, -v) unlimited
>>>>> file locks                      (-x) unlimited
>>>>> 
>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>> Host *
>>>>>        CheckHostIP             no
>>>>>        ForwardX11              yes
>>>>>        ForwardAgent            yes
>>>>>        StrictHostKeyChecking   no
>>>>>        UsePrivilegedPort       no
>>>>>        Protocol                2,1
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to