Hi Gus Correa

First of all, thanks for your suggestions.

1) The malloc function do return a non_NULL pointer.

2) I didn't tried the MPI_Isend function, actually, The really function I
need to use is MPI_Allgatherv(). When I used it, I found this function
didn't work when the the data >= 2GB, then I debugged it and found this
function finally call the MPI_Send.

3) I have a large number of data need to train. so transfer the message >=
2GB is neccerary. Although I can divided the data into smaller, but I think
the effciency will become lower too.


Regards
Xianjun Meng

2010/12/7 Gus Correa <g...@ldeo.columbia.edu>

> Hi Xianjun
>
> Suggestions/Questions:
>
> 1) Did you check if malloc returns a non-NULL pointer?
> Your program is assuming this, but it may not be true,
> and in this case the problem is not with MPI.
> You can print a message and call MPI_Abort if it doesn't.
>
> 2) Have you tried MPI_Isend/MPI_Irecv?
> Or perhaps the buffered cousin MPI_Ibsend?
>
> 3) Why do you want to send these huge messages?
> Wouldn't it be less of a trouble to send several
> smaller messages?
>
> I hope it helps,
> Gus Correa
>
> Xianjun wrote:
>
>>
>> Hi
>>
>> Are you running on two processes (mpiexec -n 2)?
>> Yes
>>
>> Have you tried to print Gsize?
>> Yes, I had checked my codes several times, and I thought the errors came
>> from the OpenMpi. :)
>>
>> The command line I used:
>> "mpirun -hostfile ./Serverlist -np 2 ./test". The "Serverlist" file
>> include several computers in my network.
>>
>> The command line that I used to build the openmpi-1.4.1:
>> ./configure --enable-debug --prefix=/usr/work/openmpi ; make all install;
>>
>> What interconnect do you use?
>> It is normal TCP/IP interconnect with 1GB network card. when I debugged my
>> codes(and the openmpi codes), I found the openMpi do call the
>> "mca_pml_ob1_send_request_start_rdma(...)" function, but I was not quite
>> sure which protocal was used when transfer 2BG data. Do you have any
>> opinions? Thanks
>>
>> Best Regards
>> Xianjun Meng
>>
>> 2010/12/7 Gus Correa <g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu
>> >>
>>
>>
>>    Hi Xianjun
>>
>>    Are you running on two processes (mpiexec -n 2)?
>>    I think this code will deadlock for more than two processes.
>>    The MPI_Recv won't have a matching send for rank>1.
>>
>>    Also, this is C, not MPI,
>>    but you may be wrapping into the negative numbers.
>>    Have you tried to print Gsize?
>>    It is probably -2147483648 in 32bit and 64bit machines.
>>
>>    My two cents.
>>    Gus Correa
>>
>>    Mike Dubman wrote:
>>
>>        Hi,
>>        What interconnect and command line do you use? For InfiniBand
>>        openib component there is a known issue with large transfers (2GB)
>>
>>        https://svn.open-mpi.org/trac/ompi/ticket/2623
>>
>>        try disabling memory pinning:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned
>>
>>
>>        regards
>>        M
>>
>>
>>        2010/12/6 <xjun.m...@gmail.com <mailto:xjun.m...@gmail.com>
>>        <mailto:xjun.m...@gmail.com <mailto:xjun.m...@gmail.com>>>
>>
>>
>>
>>           hi,
>>
>>           In my computers(X86-64), the sizeof(int)=4, but the
>>           sizeof(long)=sizeof(double)=sizeof(size_t)=8. when I checked my
>>           mpi.h file, I found that the definition about the sizeof(int) is
>>           correct. meanwhile, I think the mpi.h file was generated
>>        according
>>           to my compute environment when I compiled the Openmpi. So, my
>>        codes
>>           still don't work. :(
>>
>>           Further, I found when I called the collective routines(such as,
>>           MPI_Allgatherv(...)) which are implemented by the Point 2 Point
>>           don't work either when the data > 2GB.
>>
>>           Thanks
>>           Xianjun
>>
>>           2010/12/6 Tim Prince <n...@aol.com <mailto:n...@aol.com>
>>        <mailto:n...@aol.com <mailto:n...@aol.com>>>
>>
>>
>>
>>               On 12/5/2010 7:13 PM, Xianjun wrote:
>>
>>                   hi,
>>
>>                   I met a question recently when I tested the MPI_send and
>>                   MPI_Recv
>>                   functions. When I run the following codes, the processes
>>                   hanged and I
>>                   found there was not data transmission in my network
>>        at all.
>>
>>                   BTW: I finished this test on two X86-64 computers
>>        with 16GB
>>                   memory and
>>                   installed Linux.
>>
>>                   1 #include <stdio.h>
>>                   2 #include <mpi.h>
>>                   3 #include <stdlib.h>
>>                   4 #include <unistd.h>
>>                   5
>>                   6
>>                   7 int main(int argc, char** argv)
>>                   8 {
>>                   9 int localID;
>>                   10 int numOfPros;
>>                   11 size_t Gsize = (size_t)2 * 1024 * 1024 * 1024;
>>                   12
>>                   13 char* g = (char*)malloc(Gsize);
>>                   14
>>                   15 MPI_Init(&argc, &argv);
>>                   16 MPI_Comm_size(MPI_COMM_WORLD, &numOfPros);
>>                   17 MPI_Comm_rank(MPI_COMM_WORLD, &localID);
>>                   18
>>                   19 MPI_Datatype MPI_Type_lkchar;
>>                   20 MPI_Type_contiguous(2048, MPI_BYTE,
>> &MPI_Type_lkchar);
>>                   21 MPI_Type_commit(&MPI_Type_lkchar);
>>                   22
>>                   23 if (localID == 0)
>>                   24 {
>>                   25 MPI_Send(g, 1024*1024, MPI_Type_lkchar, 1, 1,
>>                   MPI_COMM_WORLD);
>>                   26 }
>>                   27
>>                   28 if (localID != 0)
>>                   29 {
>>                   30 MPI_Status status;
>>                   31 MPI_Recv(g, 1024*1024, MPI_Type_lkchar, 0, 1, \
>>                   32 MPI_COMM_WORLD, &status);
>>                   33 }
>>                   34
>>                   35 MPI_Finalize();
>>                   36
>>                   37 return 0;
>>                   38 }
>>
>>               You supplied all your constants as 32-bit signed data,
>>        so, even
>>               if the count for MPI_Send() and MPI_Recv() were a larger
>> data
>>               type, you would see this limit. Did you look at your
>>        <mpi.h> ?
>>
>>               --         Tim Prince
>>
>>               _______________________________________________
>>               users mailing list
>>               us...@open-mpi.org <mailto:us...@open-mpi.org>
>>        <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>
>>
>>               http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>           _______________________________________________
>>           users mailing list
>>           us...@open-mpi.org <mailto:us...@open-mpi.org>
>>        <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>
>>
>>           http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>        _______________________________________________
>>        users mailing list
>>        us...@open-mpi.org <mailto:us...@open-mpi.org>
>>        http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>    _______________________________________________
>>    users mailing list
>>    us...@open-mpi.org <mailto:us...@open-mpi.org>
>>    http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to