On Sep 26, 2008, at 1:45 PM, Robert Kubrick wrote:

I'm not sure how should I interpret this message:

[local:17344] *** An error occurred in MPI_Testsome
[local:17344] *** on communicator MPI COMMUNICATOR 5 CREATE FROM 0
[local:17344] *** MPI_ERR_TRUNCATE: message truncated
[local:17344] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpiexec noticed that job rank 0 with PID 17338 on node local exited on signal 15 (Terminated).
3 additional processes aborted (not shown)

I am assuming that the error was triggered because one of the buffers I set in the MPI_Recv_init() calls can not contain the incoming message.

Sorry for the delay in replying.

This is likely the cause -- MPI defines this as a run-time error.

However, I don't understand why job rank 0 terminates first. The only process that contains a call to MPI_Testsome has actually rank 3, and it's receiving messages from rank 0.

The aborting process sends a message to kill all the other processes in the job before it dies itself (i.e., to obey the semantics of an MPI abort). Hence, it's likely that there's a race going on here and process 0 dies before 3, so mpirun reports that first.

Also I think it would be a good idea to print the message tag in the error log.


Mm. Good point. I'll file this as a feature request -- we have centralized error reporting for the abort sequence, so it'll take a little noodling to get that in there. Probably won't happen for v1.3[. 0], but that's good real-world feedback to have. Thanks!

--
Jeff Squyres
Cisco Systems

Reply via email to