On 9/18/2011 9:12 AM, Evghenii Gaburov wrote:
Hi All,

Update to the original posting: METHOD4 also resulted in a deadlock on system HPC2 after 
5h of run with 32 MPI tasks; also, "const int scale=1;" was missing in the code 
snippet posted above.

--Evghenii

Message: 2
Date: Sun, 18 Sep 2011 02:06:33 +0000
From: Evghenii Gaburov<e-gabu...@northwestern.edu>
Subject: [OMPI users] custom sparse collective non-reproducible
        deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv
        question
To: "us...@open-mpi.org"<us...@open-mpi.org>
Message-ID:<8509050a-7357-408e-8d58-c5aefa7b3...@northwestern.edu>
Content-Type: text/plain; charset="us-ascii"

Hi All,

My MPI program's basic task consists of regularly establishing point-to-point 
communication with other procs via MPI_Alltoall, and then to communicate data. 
I tested it on two HPC clusters with 32-256 MPI tasks. One of the systems 
(HPC1) this custom collective runs flawlessly, while on another one (HPC2) the 
collective causes non-reproducible deadlocks (after a day of running, or after 
of few hours or so). So, I want to figure out whether it is a system (HPC2) bug 
that I can communicate to HPC admins, or a subtle bug in my code that needs to 
be fixed. One possibly important point, I communicate huge amount of data 
between tasks (up to ~2GB of data) in several all2all calls.

I would like to have expert eyes to look at the code to confirm or disprove 
that the code is deadlock-safe. I have implemented several methods (METHOD1 - 
METHOD4), that, if I am not mistaken, should in principle be deadlock safe. 
However, as a beginner MPI user, I can easily miss something subtle, as such I 
seek you help with this! I mostly used METHOD4 which have caused periodic 
deadlock, after having deadlocks with METHOD1 and METHOD2. On HPC1 none these 
methods deadlock in my runs. METHOD3 I am currently testing, so cannot comment 
on it as yet but will later; however, I will be happy to hear your comments.

Both system use openmpi-1.4.3.

Your answers will be of great help! Thanks!

Cheers,
Evghenii

Here is the code snippet:

  template<class T>
    void  all2all(std::vector<T>  sbuf[], std::vector<T>  rbuf[],
        const int myid,
        const int nproc)
    {
        static int nsend[NMAXPROC], nrecv[NMAXPROC];
        for (int p = 0; p<  nproc; p++)
          nsend[p] = sbuf[p].size();
        MPI_Alltoall(nsend, 1, MPI_INT, nrecv, 1, MPI_INT, MPI_COMM_WORLD);  // 
let the other tasks know how much data they will receive from this one

#ifdef _METHOD1_

        static MPI_Status  stat[NMAXPROC  ];
        static MPI_Request  req[NMAXPROC*2];
        int nreq = 0;
        for (int p = 0; p<  nproc; p++)
          if (p != myid)
          {
            const int scount = nsend[p];
            const int rcount = nrecv[p];
            rbuf[p].resize(rcount);
            if (scount>  0) MPI_Isend(&sbuf[p][0], nscount, datatype<T>(), p, 1, 
MPI_COMM_WORLD,&req[nreq++]);
            if (rcount>  0) MPI_Irecv(&rbuf[p][0], rcount,  datatype<T>(), p, 1, 
MPI_COMM_WORLD,&req[nreq++]);
          }
        rbuf[myid] = sbuf[myid];
        MPI_Waitall(nreq, req, stat);
Incidentally, here, you have 2*nproc requests, interlacing sends and receives. Your array of statuses, however, is only MAXPROC big. I think you need to declare stat[NMAXPROC*2]. Also, do you want scount in place of nscount?

Reply via email to