Re: [OMPI users] OpenMPI job launch failures

Ralph Castain Thu, 14 Feb 2013 14:59:13 -0500

Rats - sorry.

I seem to recall fixing something in 1.6 that might relate to this - a race 
condition in the startup. You might try updating to the 1.6.4 release candidate.



On Feb 14, 2013, at 11:04 AM, Bharath Ramesh <bram...@vt.edu> wrote:

> When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of
> mca_oob_tcp_message_recv_complete: invalid message type errors
> and the job just hangs even when all the nodes have fired off the
> MPI application.
> 
> 
> -- 
> Bharath
> 
> On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote:
>> I don't think this is documented anywhere, but it is an available trick (not 
>> sure if it is in 1.6.1, but might be): if you set OPAL_OUTPUT_STDERR_FD=N in 
>> your environment, we will direct all our error outputs to that file 
>> descriptor. If it is "0", then it goes to stdout.
>> 
>> Might be worth a try?
>> 
>> 
>> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bram...@vt.edu> wrote:
>> 
>>> Is there any way to prevent the output of more than one node
>>> written to the same line. I tried setting --output-filename,
>>> which didnt help. For some reason only stdout was written to the
>>> files. Making it little bit hard to read close to a 6M output
>>> file.
>>> 
>>> -- 
>>> Bharath
>>> 
>>> On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote:
>>>> Sounds like the orteds aren't reporting back to mpirun after launch. The 
>>>> MPI_proctable observation just means that the procs didn't launch in those 
>>>> cases where it is absent, which is something you already observed.
>>>> 
>>>> Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted 
>>>> report back to mpirun after it launches. If not, then it is likely that 
>>>> something is blocking it.
>>>> 
>>>> You could also try updating to 1.6.3/4 in case there is some race 
>>>> condition in 1.6.1, though we haven't heard of it to-date.
>>>> 
>>>> 
>>>> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote:
>>>> 
>>>>> On our cluster we are noticing intermediate job launch failure when using 
>>>>> OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is 
>>>>> integrated with Torque-4.1.3. It failes even for a simple MPI hello world 
>>>>> applications. The issue is that orted gets launched on all the nodes but 
>>>>> there are a bunch of nodes that dont launch the actual MPI application. 
>>>>> There are no errors reported when the job gets killed because the 
>>>>> walltime expires. Enabling --debug-daemons doesnt show any errors either. 
>>>>> The only difference being that successful runs have MPI_proctable listed 
>>>>> and for failures this is absent. Any help in debugging this issue is 
>>>>> greatly appreciated.
>>>>> 
>>>>> -- 
>>>>> Bharath
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI job launch failures

Reply via email to