Hello, I am using OpenMPI 1.4.2 and 1.5. I am working on a very large scientific software. The source code is huge and I don't have lot of freedom in this code. I can't even force the user to define a topology with mpirun.
At the moment, the software is using MPI in a very classical way: in a cluster, one MPI task = one core on a machine => for example, 4 machines with 8 cores on each, we run 32 MPI tasks. An hybrid OpenMP + MPI version is currently in development, but we do not consider it for now. At some points in the application, each task must call a Lapack function. Each task call the same function, for the same data, in the same time, for the same result. The idea here is: - on each machine, only one task call a Lapack function, an efficient multi-thread or GPU version. - other tasks are waiting. - each machine is used at 100%, and the Lapack function should be ~ 8 times more efficient. - then, the computation task should broadcast the result only for the tasks on the local machine. In my cluster example, we should have 4 local broadcast, without using the network at all. For the moment, here my implementation: void my_dpotrf_(char *uplo, int *len_uplo, double *a, int *lda, int *info) { MPI_Comm host_comm; int myrank, host_rank, size, host_id_len, color; char host_id[MPI_MAX_PROCESSOR_NAME]; size_t n2 = *len_uplo * *len_uplo; MPI_Comm_rank (MPI_COMM_WORLD, &myrank); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Get_processor_name(host_id, &host_id_len); color = my_hash(host_id, host_id_len); MPI_Comm_split(MPI_COMM_WORLD, color, myrank, &host_comm); MPI_Comm_rank(host_comm, &host_rank); if (host_rank == 0) { efficient parallel Lapack function } MPI_Bcast ( a , n2, MPI_DOUBLE, 0, host_comm ); MPI_Bcast ( info , 1, MPI_INT, 0, host_comm ); } Each host_comm communicator is grouping tasks by machines. I ran this version, but performances are worst than the current version (each task performing its own Lapack function). I have several questions: - in my implementation, is MPI_Bcast aware that it should use shared memory memory communication? Is data go through the network? It seems it is the case, considering the first results. - is there any other methods to group task by machine, OpenMPI being aware that it is grouping task by shared memory? - is it possible to assign a policy (in this case, a shared memory policy) to a Bcast or a Barrier call? - do you have any better idea for this problem? :) Regards, -- Jerome Reybert