I should know OMPI better than I do, but generally, when you make an MPI call, you could be diving into all kinds of other stuff. E.g., with non-blocking point-to-point operations, a message might make progress during another MPI call. E.g.,

MPI_Irecv(recv_req)
MPI_Isend(send_req)
MPI_Wait(send_req)
MPI_Wait(recv_req)

A receive is started in one call and completed in another, but it's quite possible that most of the data transfer (and waiting time) occurs while the program is in the calls associated with the send. The accounting gets tricky.

So, I'm guessing during the second barrier, MPI is busy making progress on the pending non-blocking point-to-point operations, where progress is possible. It isn't purely a barrier operation.

On 9/8/2011 8:04 AM, Ghislain Lartigue wrote:
This behavior happens at every call (first and following)


Here is my code (simplified):

================================================================
start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 1 barrier "//trim(local_time),0)

             do conn_index_id=1, Nconn(conn_type_id)

                   ! loop over data
                   this_data =>  block%data
                   do while (associated(this_data))

                         MPI_IRECV(...)
                         MPI_ISEND(...)

                   this_data =>  this_data%next
                   enddo

                endif

             enddo

          enddo

start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 2 barrier "//trim(local_time),0)

          done=.false.
          counter = 0
          do while (.not.done)
             do ireq=1,nreq
                if (recv_req(ireq)/=MPI_REQUEST_NULL) then
                   call MPI_TEST(recv_req(ireq),found,mystatus,icommerr)
                   if (found) then
                      call ....
                      counter=counter+1
                   endif
                endif
             enddo
             if (counter==nreq) then
                done=.true.
             endif
          enddo
================================================================

The first call to the barrier works perfectly fine, but the second one gives 
the strange behavior...

Ghislain.

Le 8 sept. 2011 à 16:53, Eugene Loh a écrit :

On 9/8/2011 7:42 AM, Ghislain Lartigue wrote:
I will check that, but as I said in first email, this strange behaviour happens 
only in one place in my code.
Is the strange behavior on the first time, or much later on?  (You seem to 
imply later on, but I thought I'd ask.)

I agree the behavior is noteworthy, but it's plausible and there's not enough 
information to explain it based solely on what you've said.

Here is one scenario.  I don't know if it applies to you since I know very 
little about what you're doing.  I think with VampirTrace, you can collect 
performance data into large buffers.  Occasionally, the buffers need to be 
flushed to disk.  VampirTrace will wait for a good opportunity to do so -- 
e.g., a global barrier.  So, you execute lots of barriers, but suddenly you hit 
one where VT wants to flush to disk.  This takes a long time and everyone in 
the barrier spends a long time in the barrier.  Then, execution resumes and 
barrier performance looks again like what it used to look like.

Again, there are various scenarios to explain what you see.  More information 
would be needed to decide which applies to you.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to