Hi,
I am using MPI_Reduce operation on 122880x400 matrix of doubles. The
parallel job runs on 32 machines, each having different processor in
terms of speed, but the architecture and OS is the same on all
machines (x86_64). The task is a typical map-and-reduce, i.e. each of
the processes collects some data, which is then summed (MPI_Reduce w.
MPI_SUM).
Having different processors, each of the jobs comes to the MPI_Reduce
in different time.
The *first problem* came when I called MPI_Reduce on the whole matrix.
The system ended up with *MPI_ERR_OTHER error*, each time on different
rank. I fixed this problem by chunking up the matrix into 2048
submatrices, calling MPI_Reduce in cycle.
However *second problem* arose --- MPI_Reduce hangs up... It
apparently gets stuck in some kind of dead-lock or something like
that. It seems that if the processors are of similar speed, the
problem disappears, however I cannot provide this condition all the
time.
I managed to get rid of the problem (at least after few
non-problematic iterations) by sticking MPI_Barrier before the
MPI_Reduce line.
The questions are:
1) is this a usual behavior???
2) is there some kind of timeout for MPI_Reduce???
3) why does MPI_Reduce die on large amount of data if the system has
enough address space (64 bit compilation)
Thanx
Ondrej Glembek
--
Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz
UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
Bozetechova 2, 612 66 Phone: +420 54114-1292
Brno, Czech Republic Fax: +420 54114-1290
ICQ: 93233896
GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C