I should know OMPI better than I do, but generally, when you make an MPI
call, you could be diving into all kinds of other stuff. E.g., with
non-blocking point-to-point operations, a message might make progress
during another MPI call. E.g.,
MPI_Irecv(recv_req)
MPI_Isend(send_req)
MPI_Wait(send_req)
MPI_Wait(recv_req)
A receive is started in one call and completed in another, but it's
quite possible that most of the data transfer (and waiting time) occurs
while the program is in the calls associated with the send. The
accounting gets tricky.
So, I'm guessing during the second barrier, MPI is busy making progress
on the pending non-blocking point-to-point operations, where progress is
possible. It isn't purely a barrier operation.
On 9/8/2011 8:04 AM, Ghislain Lartigue wrote:
This behavior happens at every call (first and following)
Here is my code (simplified):
================================================================
start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 1 barrier "//trim(local_time),0)
do conn_index_id=1, Nconn(conn_type_id)
! loop over data
this_data => block%data
do while (associated(this_data))
MPI_IRECV(...)
MPI_ISEND(...)
this_data => this_data%next
enddo
endif
enddo
enddo
start_time = MPI_Wtime()
call mpi_ext_barrier()
new_time = MPI_Wtime()-start_time
write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
call print_message("CAST GHOST DATA2 LOOP 2 barrier "//trim(local_time),0)
done=.false.
counter = 0
do while (.not.done)
do ireq=1,nreq
if (recv_req(ireq)/=MPI_REQUEST_NULL) then
call MPI_TEST(recv_req(ireq),found,mystatus,icommerr)
if (found) then
call ....
counter=counter+1
endif
endif
enddo
if (counter==nreq) then
done=.true.
endif
enddo
================================================================
The first call to the barrier works perfectly fine, but the second one gives
the strange behavior...
Ghislain.
Le 8 sept. 2011 à 16:53, Eugene Loh a écrit :
On 9/8/2011 7:42 AM, Ghislain Lartigue wrote:
I will check that, but as I said in first email, this strange behaviour happens
only in one place in my code.
Is the strange behavior on the first time, or much later on? (You seem to
imply later on, but I thought I'd ask.)
I agree the behavior is noteworthy, but it's plausible and there's not enough
information to explain it based solely on what you've said.
Here is one scenario. I don't know if it applies to you since I know very
little about what you're doing. I think with VampirTrace, you can collect
performance data into large buffers. Occasionally, the buffers need to be
flushed to disk. VampirTrace will wait for a good opportunity to do so --
e.g., a global barrier. So, you execute lots of barriers, but suddenly you hit
one where VT wants to flush to disk. This takes a long time and everyone in
the barrier spends a long time in the barrier. Then, execution resumes and
barrier performance looks again like what it used to look like.
Again, there are various scenarios to explain what you see. More information
would be needed to decide which applies to you.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users