I ran 15 test jobs using 1.6.4rc3 all of them successful. Unlike 1.6.1 where I would have around 40% of my jobs fail. Thanks for the help really appreciate it.
-- Bharath On Thu, Feb 14, 2013 at 11:59:06AM -0800, Ralph Castain wrote: > Rats - sorry. > > I seem to recall fixing something in 1.6 that might relate to this - a race > condition in the startup. You might try updating to the 1.6.4 release > candidate. > > > On Feb 14, 2013, at 11:04 AM, Bharath Ramesh <bram...@vt.edu> wrote: > > > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of > > mca_oob_tcp_message_recv_complete: invalid message type errors > > and the job just hangs even when all the nodes have fired off the > > MPI application. > > > > > > -- > > Bharath > > > > On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote: > >> I don't think this is documented anywhere, but it is an available trick > >> (not sure if it is in 1.6.1, but might be): if you set > >> OPAL_OUTPUT_STDERR_FD=N in your environment, we will direct all our error > >> outputs to that file descriptor. If it is "0", then it goes to stdout. > >> > >> Might be worth a try? > >> > >> > >> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bram...@vt.edu> wrote: > >> > >>> Is there any way to prevent the output of more than one node > >>> written to the same line. I tried setting --output-filename, > >>> which didnt help. For some reason only stdout was written to the > >>> files. Making it little bit hard to read close to a 6M output > >>> file. > >>> > >>> -- > >>> Bharath > >>> > >>> On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote: > >>>> Sounds like the orteds aren't reporting back to mpirun after launch. The > >>>> MPI_proctable observation just means that the procs didn't launch in > >>>> those cases where it is absent, which is something you already observed. > >>>> > >>>> Set "-mca plm_base_verbose 5" on your cmd line. You should see each > >>>> orted report back to mpirun after it launches. If not, then it is likely > >>>> that something is blocking it. > >>>> > >>>> You could also try updating to 1.6.3/4 in case there is some race > >>>> condition in 1.6.1, though we haven't heard of it to-date. > >>>> > >>>> > >>>> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote: > >>>> > >>>>> On our cluster we are noticing intermediate job launch failure when > >>>>> using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and > >>>>> it is integrated with Torque-4.1.3. It failes even for a simple MPI > >>>>> hello world applications. The issue is that orted gets launched on all > >>>>> the nodes but there are a bunch of nodes that dont launch the actual > >>>>> MPI application. There are no errors reported when the job gets killed > >>>>> because the walltime expires. Enabling --debug-daemons doesnt show any > >>>>> errors either. The only difference being that successful runs have > >>>>> MPI_proctable listed and for failures this is absent. Any help in > >>>>> debugging this issue is greatly appreciated. > >>>>> > >>>>> -- > >>>>> Bharath > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
smime.p7s
Description: S/MIME cryptographic signature