George,

There is too much information missing from your example. If I try to run
the code on the top assuming the process is is_host(NC.node), I have
on NC.commd
3 communications (ignore the others):

rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd);
MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid, CLOSING_ANDMSG, NC.commd,
MPI_STATUS_IGNORE);
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); }

On the andmsg I can only see 2 matching communications:

rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid, CLOSING_ANDMSG, NC.commd);
rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG, NC.commd,
MPI_STATUS_IGNORE);

So either there is a pending send (which is treated as an eager by OMPI
because it is of length 4 bytes), or there is something missing on the code
example. Can you post a more complete example ?

Thanks,
George.



On Thu, Oct 6, 2016 at 1:53 PM, George Reeke <re...@mail.rockefeller.edu>
wrote:

> Dear colleagues,
> I have a parallel MPI application written in C that works normally in
> a serial version and in the parallel version in the sense that all
> numerical output is correct.  When it tries to shut down, it gives the
> following console error messsage:
>
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[51524,1],0]
>   Exit code:    13
> -----End quoted console text-----
> The Process name given is not the number of any Linux process.
> The Exit code given seems to be any number in the range 12 to 17.
> The core dumps produced do not have usable backtrace information.
> There is no output on stderr (besides my debug messages).
> The last message written by rank 0 node on stdout and flushed is lost.
> I cannot determine the cause of the problem.
> Let me be as explicit as possible:
> OS RHEL 6.8, compiler gcc 4.4.7 with -g, no optimization
> Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64
> The startup command is like this:
> mpirun --output-filename junk -mca btl_tcp_if_include lo -n 1 cnsP0 NOSP :
> -n 3 cnsPn < v8tin/dan
>
> cnsP0 is a master code that reads a control file (specified after the
> '<' on the command line).  The other executables (cnsPn) only send and
> receive messages and do math, no file IO.  I get same results with
> 3 or 4 compute nodes.
>    Early in startup, another process is started via MPI_Comm_spawn.
> I suspect this is relevant to the problem, although simple test
> programs using the same setup complete normally.  This process,
> andmsg, receives status or debug information asynchronously via
> messages from the other processes and writes them to stderr.
> I have tried many versions of the shutdown code, all with the same
> result.  Here is one version (debug writes (using fwrite()and
> fflush()) are deleted, comments modified for clarity):
>
> Application code (cnsP0 and cnsPn):
>    /* Everything works OK up to here (stdout and debug output). */
>    int rc, ival = 0;
>    /* In next line, NC.dmsgid is rank # of andmsg process and
>    *  NC.commd is intercommunicator to it.  andmsg counts these
>    *  shutdown messages, one from each app node.  */
>    rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG,
>       NC.commd);
>    /* This message confirms that andmsg got 4 SHUTDOWN messages.
>    *  "is_host(NC.node)" returns 1 if this is the rank 0 node.  */
>    if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid,
>       CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE); }
>    /* Results are similar with or without this barrier.  Debug lines
>    *  written on stderr from all nodes after barrier appear OK. */
>    rc = MPI_Barrier(NC.commc);  /* NC.commc is original world comm */
>    /* Behavior is same with or without this extra message exchange,
>    *  which I added to keep andmsg from terminating before the
>    *  barrier among the other nodes completes. */
>    if (is_host(NC.node)) { rc = MPI_Send(&ival, 1, MPI_INT,
>        NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); }
>    /* Behavior is same with or without this disconnect */
>    rc = MPI_Comm_disconnect(&NC.commd);
>    rc = MPI_Finalize();
>    exit(0);
>
> Spawned process (andmsg) code extract:
>
>    if (num2stop <= 0) { /* Countdown of shutdown messages received */
>       int rc;
>       /* This message confirms to main app that shutdown messages
>       *  were received from all nodes.  */
>       rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid,
>          CLOSING_ANDMSG, NC.commd);
>       /* Receive extra synch message commented above */
>       rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG,
>             NC.commd, MPI_STATUS_IGNORE);
>       sleep(1);   /* Results are same with or without this sleep */
>       /* Results are same with or without this disconnect */
>       rc = MPI_Comm_disconnect(&NC.commd);
>       rc = MPI_Finalize();
>       exit(0);
>       }
>
> I would much appreciate any suggestions how to debug this.
> From the suggestions at the community help web page, here is more
> information:
> config.log file, bzipped version, is attached.
> ompi_info --all bzipped output is attached.
> I am not sending information from other nodes or network config--for
> test purposes, all processes are running on the one node, my laptop
> with i7 processor.  I set the "-mca btl_tcp_if_include lo" parameter
> earlier when I got an error message about a refused connection
> (that my code never asked for in the first place).  This got rid
> of that error message but application still fails and dumps.
> Thanks,
> George Reeke
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to