Hello,

I am using OpenMPI 1.4.2 and 1.5. I am working on a very large scientific
software. The source code is huge and I don't have lot of freedom in this code.
I can't even force the user to define a topology with mpirun.

At the moment, the software is using MPI in a very classical way: in a cluster,
one MPI task = one core on a machine => for example, 4 machines with 8 cores on
each, we run 32 MPI tasks. An hybrid OpenMP + MPI version is currently  in
development, but we do not consider it for now.

At some points in the application, each task must call a Lapack function. Each
task call the same function, for the same data, in the same time, for the same
result. The idea here is:

  - on each machine, only one task call a Lapack function, an efficient
multi-thread or GPU version.
  - other tasks are waiting.
  - each machine is used at 100%, and the Lapack function should be ~ 8 times
more efficient.
  - then, the computation task should broadcast the result only for the tasks on
the local machine. In my cluster example, we should have 4 local broadcast,
without using the network at all.

For the moment, here my implementation:

void my_dpotrf_(char *uplo, int *len_uplo, double *a, int *lda, int *info) {
  MPI_Comm host_comm;
  int myrank, host_rank, size, host_id_len, color;
  char host_id[MPI_MAX_PROCESSOR_NAME];
  size_t n2 = *len_uplo * *len_uplo;

  MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
  MPI_Comm_size (MPI_COMM_WORLD, &size);
  MPI_Get_processor_name(host_id, &host_id_len);

  color = my_hash(host_id, host_id_len);
  MPI_Comm_split(MPI_COMM_WORLD, color, myrank, &host_comm);
  MPI_Comm_rank(host_comm, &host_rank);

  if (host_rank == 0) {
    efficient parallel Lapack function
  } 
  MPI_Bcast ( a , n2, MPI_DOUBLE, 0, host_comm );
  MPI_Bcast ( info , 1, MPI_INT, 0, host_comm );
} 

Each host_comm communicator is grouping tasks by machines. I ran this version,
but performances are worst than the current version (each task performing its
own Lapack function). I have several questions:

  - in my implementation, is MPI_Bcast aware that it should use shared memory
memory communication? Is data go through the network? It seems it is the case,
considering the first results.
  - is there any other methods to group task by machine, OpenMPI being aware
that it is grouping task by shared memory?
  - is it possible to assign a policy (in this case, a shared memory policy) to
a Bcast or a Barrier call?
  - do you have any better idea for this problem? :)

Regards,

-- 
Jerome Reybert

Reply via email to