Re: [OMPI users] Question about collective messages implementation

2010-11-29 Thread George Bosilca
With the increasing gap between network bandwidth and processor computing 
power, the current trend in linear algebra is toward communication avoiding 
algorithms (aka. replacing communications with redundant computations). You're 
taking the exact opposite path, I wonder if you can get any benefit ...

Moreover, your proposed approach only makes sense if you expect the LAPACK 
operation to be faster if the other cores are silent (in order to use them for 
the computation itself). This is very tricky to do for a single LAPACK call, as 
usually OMP_NUM_THREADS & co. are affecting the entire application. I remember 
reading somewhere that MKL provide a function to change the number of threads 
at runtime, maybe you should look in that direction.

  george.


On Nov 2, 2010, at 06:33 , Ashley Pittman wrote:

> 
> On 2 Nov 2010, at 10:21, Jerome Reybert wrote:
>> - in my implementation, is MPI_Bcast aware that it should use shared memory
>> memory communication? Is data go through the network? It seems it is the 
>> case,
>> considering the first results.
>> - is there any other methods to group task by machine, OpenMPI being aware
>> that it is grouping task by shared memory?
>> - is it possible to assign a policy (in this case, a shared memory policy) to
>> a Bcast or a Barrier call?
>> - do you have any better idea for this problem? :)
> 
> Interesting stuff, two points quickly spring to mind from the above:
> 
> MPI_Comm_split() is an expensive operation, sure the manual says it's low 
> cost but it shouldn't be used inside any critical loops so be sure you are 
> doing the Comm_Split() at startup and then re-using it as and when needed.
> 
> Any blocking call into OpenMPI will poll consuming CPU cycles until the call 
> is complete, you can mitigate against this by telling OpenMPI to aggressively 
> call yield whilst polling which would mean that your parallel Lapack function 
> could get the CPU resources it required.  Have a look at this FAQ entry for 
> details of the option and what you can expect it to do.
> 
> http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded
> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Question about collective messages implementation

2010-11-08 Thread Jerome Reybert
Ashley Pittman  pittman.co.uk> writes:

> MPI_Comm_split() is an expensive operation, sure the manual says it's low cost
but it shouldn't be used
> inside any critical loops so be sure you are doing the Comm_Split() at startup
and then re-using it as and
> when needed.
> 
> Any blocking call into OpenMPI will poll consuming CPU cycles until the call
is complete, you can mitigate
> against this by telling OpenMPI to aggressively call yield whilst polling
which would mean that your
> parallel Lapack function could get the CPU resources it required.  Have a look
at this FAQ entry for details
> of the option and what you can expect it to do.
> 
> http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded
> 
> Ashley.
> 

Thanks for your detailed responses. Actually, the problem came from a stupied
error of me...

However, I'll have a depth look about the active waiting you describe. I'll try
to see if it affects my performances.

Jérôme



Re: [OMPI users] Question about collective messages implementation

2010-11-02 Thread Jeff Squyres
On Nov 2, 2010, at 6:21 AM, Jerome Reybert wrote:

> Each host_comm communicator is grouping tasks by machines. I ran this version,
> but performances are worst than the current version (each task performing its
> own Lapack function). I have several questions:

>  - in my implementation, is MPI_Bcast aware that it should use shared memory
> memory communication? Is data go through the network? It seems it is the case,
> considering the first results.

It should use shared memory by default.

>  - is there any other methods to group task by machine, OpenMPI being aware
> that it is grouping task by shared memory?

The MPI API does not expose this kind of functionality, but there's at least 1 
proposal in front of the MPI Forum to do this kind of thing.

As Ashley mentioned, you might want to do this MPI_Comm_split once and then 
just use that communicator from then on.  The code snipit you sent leaks the 
host_comm, for example.

>  - is it possible to assign a policy (in this case, a shared memory policy) to
> a Bcast or a Barrier call?

Not really, no.

>  - do you have any better idea for this problem? :)

Ashley probably hit the nail on the head.  The short version is that OMPI 
aggressively polls for progress.  Forcing the degraded mode will help (because 
it'll yield), but it won't solve the problem because it'll still be 
aggressively polling -- it'll just yield every time it polls.  But it's still 
polling.

We've had many discussions about this topic, but have never really addressed it 
-- the need for low latency has been greater than the need for 
blocking/not-consuming-CPU progress.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Question about collective messages implementation

2010-11-02 Thread Ashley Pittman

On 2 Nov 2010, at 10:21, Jerome Reybert wrote:
>  - in my implementation, is MPI_Bcast aware that it should use shared memory
> memory communication? Is data go through the network? It seems it is the case,
> considering the first results.
>  - is there any other methods to group task by machine, OpenMPI being aware
> that it is grouping task by shared memory?
>  - is it possible to assign a policy (in this case, a shared memory policy) to
> a Bcast or a Barrier call?
>  - do you have any better idea for this problem? :)

Interesting stuff, two points quickly spring to mind from the above:

MPI_Comm_split() is an expensive operation, sure the manual says it's low cost 
but it shouldn't be used inside any critical loops so be sure you are doing the 
Comm_Split() at startup and then re-using it as and when needed.

Any blocking call into OpenMPI will poll consuming CPU cycles until the call is 
complete, you can mitigate against this by telling OpenMPI to aggressively call 
yield whilst polling which would mean that your parallel Lapack function could 
get the CPU resources it required.  Have a look at this FAQ entry for details 
of the option and what you can expect it to do.

http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




[OMPI users] Question about collective messages implementation

2010-11-02 Thread Jerome Reybert
Hello,

I am using OpenMPI 1.4.2 and 1.5. I am working on a very large scientific
software. The source code is huge and I don't have lot of freedom in this code.
I can't even force the user to define a topology with mpirun.

At the moment, the software is using MPI in a very classical way: in a cluster,
one MPI task = one core on a machine => for example, 4 machines with 8 cores on
each, we run 32 MPI tasks. An hybrid OpenMP + MPI version is currently  in
development, but we do not consider it for now.

At some points in the application, each task must call a Lapack function. Each
task call the same function, for the same data, in the same time, for the same
result. The idea here is:

  - on each machine, only one task call a Lapack function, an efficient
multi-thread or GPU version.
  - other tasks are waiting.
  - each machine is used at 100%, and the Lapack function should be ~ 8 times
more efficient.
  - then, the computation task should broadcast the result only for the tasks on
the local machine. In my cluster example, we should have 4 local broadcast,
without using the network at all.

For the moment, here my implementation:

void my_dpotrf_(char *uplo, int *len_uplo, double *a, int *lda, int *info) {
  MPI_Comm host_comm;
  int myrank, host_rank, size, host_id_len, color;
  char host_id[MPI_MAX_PROCESSOR_NAME];
  size_t n2 = *len_uplo * *len_uplo;

  MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
  MPI_Comm_size (MPI_COMM_WORLD, &size);
  MPI_Get_processor_name(host_id, &host_id_len);

  color = my_hash(host_id, host_id_len);
  MPI_Comm_split(MPI_COMM_WORLD, color, myrank, &host_comm);
  MPI_Comm_rank(host_comm, &host_rank);

  if (host_rank == 0) {
efficient parallel Lapack function
  } 
  MPI_Bcast ( a , n2, MPI_DOUBLE, 0, host_comm );
  MPI_Bcast ( info , 1, MPI_INT, 0, host_comm );
} 

Each host_comm communicator is grouping tasks by machines. I ran this version,
but performances are worst than the current version (each task performing its
own Lapack function). I have several questions:

  - in my implementation, is MPI_Bcast aware that it should use shared memory
memory communication? Is data go through the network? It seems it is the case,
considering the first results.
  - is there any other methods to group task by machine, OpenMPI being aware
that it is grouping task by shared memory?
  - is it possible to assign a policy (in this case, a shared memory policy) to
a Bcast or a Barrier call?
  - do you have any better idea for this problem? :)

Regards,

-- 
Jerome Reybert