"Usually", Open MPI does emit something more detailed than just the equivalent 
of a one-line "process aborted" message.  Open MPI does de-duplicate most 
error/help messages, too, so you should see most kinds of descriptive error 
messages once, and potentially a few more "X more messages showed the same 
message" kinds of lines.

That being said, Open MPI can usually only report *what* happened -- not 
necessarily *why* it happened.  E.g., Open MPI can tell you that a socket/queue 
pair/whatever closed, but not necessarily *why* it closed.  Sometimes, for 
example, you can get a message from Open MPI that a peer unexpectedly closed 
its network connection -- but the real issue is that the peer aborted because 
of a seg fault -- and therefore closed all network connections.  In this case, 
it is possible that both errors are reported, but you might need to really pick 
through the stream of output to see that one or more processes died due to a 
segv and others died due to unexpected network connection closures.

In such situations, it's usually necessary to save *all* the output from a job 
-- don't just look at (or have the job queueing system save) the last N lines.  
The *real* reason that kicked off the series of errors that had the ultimate 
domino effect of killing the job may be detailed at N+1 lines before the end.

Are you seeing jobs terminate with no output whatsoever?  That would be unusual.

> On Aug 4, 2019, at 8:39 AM, Passant A. Hafez <passant.ha...@kaust.edu.sa> 
> wrote:
> Hello Jeff,
> In short, Yes. 
> To further explain what I meant, I see many problems which will just end in 
> termination of the MPI job, sharing the same error message (which is just 
> saying that the process aborted) while the underlying reason are different, 
> sometimes related to the code, some other times related to hardware, 
> networking, configuration of Infiniband.
> I want when I get such error to have details that guide me to which area I 
> should investigate, without spitting very detailed logs like the output of 
> strace for example, so it doesn't make the actual output of the MPI job 
> harder to read.
> I assume it could be either something enabled during compilation of OMPI 
> itself, or something passed during runtime (will be better).
> All the best,
> --
> Passant 
> ________________________________________
> From: Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> Sent: Sunday, July 28, 2019 5:52 PM
> To: Open MPI User's List
> Cc: Passant A. Hafez
> Subject: Re: [OMPI users] Debug OMPI errors
> I'm not sure exactly what you are asking -- can you be more specific?
> Are you asking if Open MPI can emit more detail when an error occurs and the 
> job aborts?
>> On Jul 28, 2019, at 4:12 AM, Passant A. Hafez via users 
>> <users@lists.open-mpi.org> wrote:
>> Hello all,
>> I was wondering if I can enable some reasonable level of debugging for OMPI 
>> errors, especially in the cases that just report that a process is killed 
>> (for example MPI_ABORT was invoked) and that's it.
>> All the best,
>> --
>> Passant
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> --
> Jeff Squyres
> jsquy...@cisco.com

Jeff Squyres

users mailing list

Reply via email to