George, There is too much information missing from your example. If I try to run the code on the top assuming the process is is_host(NC.node), I have on NC.commd 3 communications (ignore the others):
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid, CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE); rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); } On the andmsg I can only see 2 matching communications: rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid, CLOSING_ANDMSG, NC.commd); rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG, NC.commd, MPI_STATUS_IGNORE); So either there is a pending send (which is treated as an eager by OMPI because it is of length 4 bytes), or there is something missing on the code example. Can you post a more complete example ? Thanks, George. On Thu, Oct 6, 2016 at 1:53 PM, George Reeke <re...@mail.rockefeller.edu> wrote: > Dear colleagues, > I have a parallel MPI application written in C that works normally in > a serial version and in the parallel version in the sense that all > numerical output is correct. When it tries to shut down, it gives the > following console error messsage: > > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[51524,1],0] > Exit code: 13 > -----End quoted console text----- > The Process name given is not the number of any Linux process. > The Exit code given seems to be any number in the range 12 to 17. > The core dumps produced do not have usable backtrace information. > There is no output on stderr (besides my debug messages). > The last message written by rank 0 node on stdout and flushed is lost. > I cannot determine the cause of the problem. > Let me be as explicit as possible: > OS RHEL 6.8, compiler gcc 4.4.7 with -g, no optimization > Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64 > The startup command is like this: > mpirun --output-filename junk -mca btl_tcp_if_include lo -n 1 cnsP0 NOSP : > -n 3 cnsPn < v8tin/dan > > cnsP0 is a master code that reads a control file (specified after the > '<' on the command line). The other executables (cnsPn) only send and > receive messages and do math, no file IO. I get same results with > 3 or 4 compute nodes. > Early in startup, another process is started via MPI_Comm_spawn. > I suspect this is relevant to the problem, although simple test > programs using the same setup complete normally. This process, > andmsg, receives status or debug information asynchronously via > messages from the other processes and writes them to stderr. > I have tried many versions of the shutdown code, all with the same > result. Here is one version (debug writes (using fwrite()and > fflush()) are deleted, comments modified for clarity): > > Application code (cnsP0 and cnsPn): > /* Everything works OK up to here (stdout and debug output). */ > int rc, ival = 0; > /* In next line, NC.dmsgid is rank # of andmsg process and > * NC.commd is intercommunicator to it. andmsg counts these > * shutdown messages, one from each app node. */ > rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, > NC.commd); > /* This message confirms that andmsg got 4 SHUTDOWN messages. > * "is_host(NC.node)" returns 1 if this is the rank 0 node. */ > if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid, > CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE); } > /* Results are similar with or without this barrier. Debug lines > * written on stderr from all nodes after barrier appear OK. */ > rc = MPI_Barrier(NC.commc); /* NC.commc is original world comm */ > /* Behavior is same with or without this extra message exchange, > * which I added to keep andmsg from terminating before the > * barrier among the other nodes completes. */ > if (is_host(NC.node)) { rc = MPI_Send(&ival, 1, MPI_INT, > NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); } > /* Behavior is same with or without this disconnect */ > rc = MPI_Comm_disconnect(&NC.commd); > rc = MPI_Finalize(); > exit(0); > > Spawned process (andmsg) code extract: > > if (num2stop <= 0) { /* Countdown of shutdown messages received */ > int rc; > /* This message confirms to main app that shutdown messages > * were received from all nodes. */ > rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid, > CLOSING_ANDMSG, NC.commd); > /* Receive extra synch message commented above */ > rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG, > NC.commd, MPI_STATUS_IGNORE); > sleep(1); /* Results are same with or without this sleep */ > /* Results are same with or without this disconnect */ > rc = MPI_Comm_disconnect(&NC.commd); > rc = MPI_Finalize(); > exit(0); > } > > I would much appreciate any suggestions how to debug this. > From the suggestions at the community help web page, here is more > information: > config.log file, bzipped version, is attached. > ompi_info --all bzipped output is attached. > I am not sending information from other nodes or network config--for > test purposes, all processes are running on the one node, my laptop > with i7 processor. I set the "-mca btl_tcp_if_include lo" parameter > earlier when I got an error message about a refused connection > (that my code never asked for in the first place). This got rid > of that error message but application still fails and dumps. > Thanks, > George Reeke > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users