Wow this sparked a much more heated discussion than I was expecting. I
was just commenting that the behaviour the original author (Federico
Sacerdoti) mentioned would explain something I observed in one of my
early trials of OpenMPI. But anyway, because it seems that quite a few
people were interested, I've attached a simplified version of the test I
was describing (with all the timing checks and some of the crazier
output removed).

Now that I go back and retest this it turns out that it wasn't actually
a segfault that was killing it, but running out of memory as you and
others have predicted.

Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote:
> Now that this discussion has gone way off into the MPI standard woods :).
> 
> Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? 
> There was definitely a bug in 1.2.4 that could cause exactly the behavior 
> you are describing when using the shared memory BTL, due to a silly 
> delayed initialization bug/optimization.

I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the
hardest to break (I guess I'm dodging the bullet on that delayed
initialization bug you're referring to).

> If you are using the OB1 PML (the default), you will still have the 
> possibility of running the receiver out of memory if the unexpected queue 
> grows without bounds.  I'll withold my opinion on what the standard says 
> so that we can perhaps actually help you solve your problem and stay out 
> of the weeds :).  Note however, that in general unexpected messages are a 
> bad idea and thousands of them from one peer to another should be avoided 
> at all costs -- this is just good MPI programming practice.

Actually I was expecting to break something with this test. I just
wanted to find out where it broke. Lesson learned, I wrote my more
serious programs doing exactly that (no unexpected messages). I was just
surprised that the default Open MPI settings allowed me to flood the
system so easily whereas MPICH/MX still finished not matter what I threw
at it (albeit with terrible performance (in the bad cases)).

> Now, if you are using MX, you can replicate MPICH/MX's behavior (including 
> the very slow part) by using the CM PML (--mca pml cm on the mpirun 
> command line), which will use the MX library message matching and 
> unexpected queue and therefore behave exactly like MPICH/MX.

That works exactly as you described, and it does indeed prevent memory
usage from going wild due to the unexpected messages.

Thanks for your help! (and to the others for the educational discussion!)

> 
> Brian
> 
> 
> On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote:
> 
>> That would make sense. I able to break OpenMPI by having Node A wait for
>> messages from Node B. Node B is in fact sleeping while Node C bombards
>> Node A with a few thousand messages. After a while Node B wakes up and
>> sends Node A the message it's been waiting on, but Node A has long since
>> been buried and seg faults. If I decrease the number of messages C is
>> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the
>> SM BTL (might have been MX or TCP, but certainly not infiniband. I could
>> dig up the test and try again if anyone is seriously curious).
>>
>> Trying the same test on MPICH/MX went very very slow (I don't think they
>> have any clever buffer management) but it didn't crash.
>>
>> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
>> |openmpi-users/Allow| wrote:
>>> Hi,
>>>
>>> I am readying an openmpi 1.2.5 software stack for use with a
>>> many-thousand core cluster. I have a question about sending small
>>> messages that I hope can be answered on this list.
>>>
>>> I was under the impression that if node A wants to send a small MPI
>>> message to node B, it must have a credit to do so. The credit assures A
>>> that B has enough buffer space to accept the message. Credits are
>>> required by the mpi layer regardless of the BTL transport layer used.
>>>
>>> I have been told by a Voltaire tech that this is not so, the credits are
>>> used by the infiniband transport layer to reliably send a message, and
>>> is not an openmpi feature.
>>>
>>> Thanks,
>>> Federico
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
--Kris

叶ってしまう夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]
#include <mpi.h>
#include <iostream>
#include <list>
#include <vector>

#include <stdlib.h> //for atoi (in case someone doesn't have boost)

const int buflen=5000;

int main(int argc, char *argv[]) {
  using namespace std;
  int reps=1000;
  if(argc>1){ //optionally specify number of repeats on the command line
    reps=atoi(argv[1]);
  }

  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  cerr << "Process "<<rank<<" ("<< getpid()<<") on "<<processor_name<<" out of "<<numprocs<<"\n";
  if(rank>0){
    cerr << "Process "<<rank<<" sleeping..."<<endl;
    sleep(rank);
    cerr << "Process "<<rank<<" sending..."<<endl;
    list<MPI_Request> sendQ;
    char somememory;

    for(int i=0;i<reps;i++){
      MPI_Status s;
      int f=0;
      for(list<MPI_Request>::iterator ite=sendQ.begin();ite!=sendQ.end();){
	MPI_Test(&*ite,&f,&s);
	if(f){
	  list<MPI_Request>::iterator j=ite;
	  ++ite;
	  sendQ.erase(j);
	}else{
	  break; //these should be received in order, so if we have a pending one, stop there.
	}
      }

      sendQ.push_back(MPI_Request());
      MPI_Issend(&somememory,0,MPI_CHAR,0,0,MPI_COMM_WORLD,&sendQ.back());
    }
    cerr << "Process "<<rank<<" waiting on remaining "<< sendQ.size() << " sends..."<<endl;
    while(sendQ.size()){
      for(list<MPI_Request>::iterator i=sendQ.begin();i!=sendQ.end();){
	MPI_Status status;
	int finished=0;
	int ret=MPI_Test(&*i,&finished,&status);
	if(finished){
	  list<MPI_Request>::iterator j=i;
	  i++;
	  sendQ.erase(j);
	}else{
	  i++;
	}
      }
    }
    cerr << "Process "<<rank<<" cleanup done."<<endl;
  }else{
    cerr << "Master proc sleeping...\n"<<endl;
    sleep(numprocs+2);
    cerr << "Mast proc wakin'"<<endl;
    int expected=(numprocs-1)*reps;
    int tick=expected/100; //every 1% make a tick
    int nextTick=tick;
    char somememory;
    for(int count=0;count<expected;count++){
      MPI_Status status;
      int nextSender=numprocs-(count/reps)-1; //receive messages from last sender first
      MPI_Recv(&somememory,0,MPI_CHAR,nextSender,0,MPI_COMM_WORLD,&status);
      //as an alternate, this causes fewer unexpected messages, but can still use up an absurd amount of ram!
      //MPI_Recv(&somememory,0,MPI_CHAR,MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&status);
      int recv_count=0;
      MPI_Get_count(&status,MPI_CHAR,&recv_count);
      if(count==nextTick){
	cerr << "*";
	nextTick+=tick;
      }
    }
    cerr << endl;
    cerr << "All messages accounted for!\n";
  }

  MPI_Barrier(MPI_COMM_WORLD);

  if(rank==0)
    cerr << "All procs done."<<endl;

  MPI_Finalize();
}

Reply via email to