Rats - sorry. I seem to recall fixing something in 1.6 that might relate to this - a race condition in the startup. You might try updating to the 1.6.4 release candidate.
On Feb 14, 2013, at 11:04 AM, Bharath Ramesh <bram...@vt.edu> wrote: > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of > mca_oob_tcp_message_recv_complete: invalid message type errors > and the job just hangs even when all the nodes have fired off the > MPI application. > > > -- > Bharath > > On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote: >> I don't think this is documented anywhere, but it is an available trick (not >> sure if it is in 1.6.1, but might be): if you set OPAL_OUTPUT_STDERR_FD=N in >> your environment, we will direct all our error outputs to that file >> descriptor. If it is "0", then it goes to stdout. >> >> Might be worth a try? >> >> >> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bram...@vt.edu> wrote: >> >>> Is there any way to prevent the output of more than one node >>> written to the same line. I tried setting --output-filename, >>> which didnt help. For some reason only stdout was written to the >>> files. Making it little bit hard to read close to a 6M output >>> file. >>> >>> -- >>> Bharath >>> >>> On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote: >>>> Sounds like the orteds aren't reporting back to mpirun after launch. The >>>> MPI_proctable observation just means that the procs didn't launch in those >>>> cases where it is absent, which is something you already observed. >>>> >>>> Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted >>>> report back to mpirun after it launches. If not, then it is likely that >>>> something is blocking it. >>>> >>>> You could also try updating to 1.6.3/4 in case there is some race >>>> condition in 1.6.1, though we haven't heard of it to-date. >>>> >>>> >>>> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote: >>>> >>>>> On our cluster we are noticing intermediate job launch failure when using >>>>> OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is >>>>> integrated with Torque-4.1.3. It failes even for a simple MPI hello world >>>>> applications. The issue is that orted gets launched on all the nodes but >>>>> there are a bunch of nodes that dont launch the actual MPI application. >>>>> There are no errors reported when the job gets killed because the >>>>> walltime expires. Enabling --debug-daemons doesnt show any errors either. >>>>> The only difference being that successful runs have MPI_proctable listed >>>>> and for failures this is absent. Any help in debugging this issue is >>>>> greatly appreciated. >>>>> >>>>> -- >>>>> Bharath >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users