Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-30 Thread Jeff Squyres (jsquyres)
On Nov 24, 2015, at 9:31 AM, Dave Love  wrote:
> 
>> btw, we already use the force, thanks to the ob1 pml and the yoda spml
> 
> I think that's assuming familiarity with something which leaves out some
> people...

FWIW, I agree: we use unhelpful names for components in Open MPI.  What Gilles 
is specifically referring to here is that there are several Star Wars-based 
names of plugins in Open MPI.  They mean something to us developers (they 
started off as a funny joke), but they mean little/nothing to end users.

I actually specifically called out this issue in the SC'15 Open MPI BOF:

http://image.slidesharecdn.com/ompi-bof-2015-for-web-151130155610-lva1-app6891/95/open-mpi-sc15-state-of-the-union-bof-28-638.jpg?cb=1448898995

This is definitely an issue that is on the agenda for the face-to-face Open MPI 
developer's meeting in February 
(https://github.com/open-mpi/ompi/wiki/Meeting-2016-02).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] MPI_AllReduce vs MPI_IAllReduce

2015-11-30 Thread Felipe .
Thanks for the reply, Ralph.

Now I think it is clearer to me why it could be so much slower. The reason
would be that the blocking algorithm for reduction has a implementation
very different than the non-blocking.

Since there are lots of ways to implement it, are there options to tune the
non-blocking reduction algorithm and its parameters?

Something like the ones we have for the blocking versions, for instance:
"coll_tuned_allreduce_algorithm", "coll_tuned_reduce_algorithm", etc.

--
Felipe

2015-11-27 18:20 GMT-02:00 Ralph Castain :

> One thing you might want to keep in mind is that “non-blocking” doesn’t
> mean “asynchronous progress”. The API may not block, but the communications
> only progress whenever you actually call down into the library.
>
> So if you are calling a non-blocking collective, and then make additional
> calls into MPI only rarely, you should expect to see slower performance.
>
> We are working on providing async progress on all operations, but I don’t
> believe much (if any) of it is in the release branches so far.
>
>
> On Nov 27, 2015, at 11:37 AM, Felipe .  wrote:
>
> >Try and do a variable amount of work for every process, I see non-blocking
>
> >as a way to speed-up communication if they arrive individually to the
> call.
> >Please always have this at the back of your mind when doing this.
>
> I tried to simplify the problem at the explanation. The
> "local_computation" is variable among different processes, so there is load
> imbalance in the real problem.
> The microbenchmark was just a way to test the overhead, which was really
> much greater than expectations.
>
> >Surely non-blocking has overhead, and if the communication time is low, so
>
> >will the overhead be much higher.
>
> Off course there is. But for my case, which is a real HPC application for
> seismic data processing, it was prohibitive and strangely high.
>
> >You haven't specified what nx*ny*nz is, and hence your "slower" and
> >"faster" makes "no sense"... And hence your questions are difficult to
> >answer, basically "it depends".
>
> On my tests, I used nx = 700, ny = 200,  nz = 60, total_iter = 1000. val
> is a real(4) array. This is basically the same sizeas the real application.
> Since I used the same values for all tests, it is reasonable to analyze
> the results.
> What I meant with question 1 was: overheads so high are expected?
>
> The microbenchmark is attached to this e-mail.
>
> The detailed result was (using 11 nodes):
>
> openmpi blocking:
>  ==
>  [RESULT] Reduce time =  21.790411
>  [RESULT] Total  time =  24.977373
>  ==
>
> openmpi non-blocking:
>  ==
>  [RESULT] Reduce time =  97.332792
>  [RESULT] Total  time = 100.470874
>  ==
>
> Intel MPI + blocking:
>  ==
>  [RESULT] Reduce time =  17.587828
>  [RESULT] Total  time =  20.655875
>  ==
>
>
> Intel MPI + non-blocking:
>  ==
>  [RESULT] Reduce time =  49.483195
>  [RESULT] Total  time =  52.642514
>  ==
>
> Thanks in advance.
>
> 2015-11-27 14:57 GMT-02:00 Felipe . :
>
>> Hello!
>>
>> I have a program that basically is (first implementation):
>> for i in N:
>>   local_computation(i)
>>   mpi_allreduce(in_place, i)
>>
>> In order to try to mitigate the implicit barrier of the mpi_allreduce, I
>> tried to start an mpi_Iallreduce. Like this(second implementation):
>> for i in N:
>>   local_computation(i)
>>   j = i
>>   if i is not first:
>> mpi_wait(request)
>>   mpi_Iallreduce(in_place, j, request)
>>
>> The result was that the second was a lot worse. The processes spent 3
>> times more time on the mpi_wait than on the mpi_allreduce from the first
>> implementation. I know it could be worst, but not that much.
>>
>> So, I made a microbenchmark to stress this, in Fortran. Here is the
>> implementation:
>> Blocking:
>> do i = 1, total_iter ! [
>> t_0 = mpi_wtime()
>>
>> call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
>> MPI_COMM_WORLD, ierror)
>> if (ierror .ne. 0) then ! [
>> write(*,*) "Error in line ", __LINE__, " rank = ", rank
>> call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>> end if ! ]
>> t_reduce = t_reduce + (mpi_wtime() - t_0)
>> end do ! ]
>>
>> Non-Blocking:
>> do i = 1, total_iter ! [
>> t_0 = mpi_wtime()
>> call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
>> MPI_COMM_WORLD, request, ierror)
>> if (ierror .ne. 0) then ! [
>> write(*,*) "Error in line ", __LINE__, " rank = ", rank
>> call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>> end if ! ]
>> t_reduce = t_reduce + (mpi_wtime() - t_0)
>>
>> t_0 = mpi_wtime()
>> call mpi_wait(request, status, ierror)
>> if (ierror .ne. 0) then ! [
>>