>>>>> "Ralph" == Ralph Castain <r...@open-mpi.org> writes:

    Ralph> On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

    >>>>>>> "Ralph" == Ralph Castain <r...@open-mpi.org> writes:
    >> 
    Ralph> I'm not sure why the group communicator would make a
    Ralph> difference - the code area in question knows nothing about
    Ralph> the mpi aspects of the job. It looks like you are hitting a
    Ralph> race condition that causes a particular internal recv to
    Ralph> not exist when we subsequently try to cancel it, which
    Ralph> generates that error message.  How did you configure OMPI?
    >> 
    >> Thank you for the reply!
    >> 
    >> Must be some race problem, but I have no control of it, or do
    >> I?

    Ralph> Not really. What I don't understand is why your code would
    Ralph> work fine when using comm_world, but encounter a race
    Ralph> condition when using comm groups. There shouldn't be any
    Ralph> timing difference between the two cases.

Fixing race condition is sometime easy by puting some variables into
the arrays. I just did for one of them but it didn't help. I'll do
some more testing in this direction, but I am running out of ideas.
When you put ngrp=1 and uncomment the other mpi_comm_spawn line in the
program you basically get only one spawn, so no opportunity for race
condition. But in my real project I usually work with many spawn
calls, however all using mpi_comm_world, but running different
programs, etc. And that always works. This time I want to localize
mpi_comm_spawns by similar trick that is in the program I sent. So
this small test case is a good model of what I would like to have.
I studied the MPI-2 standard and I think I got it right, but one never
knows...

    Ralph> I'll have to take a look and see if I can spot something in
    Ralph> the code...

Thanks a lot -- Milan

Reply via email to