[OMPI users] Still bothered / cannot run an application
(cross-post to 'users' and 'devel' mailing lists) Dear Open MPI developer, a long time ago, I reported about an error in Open MPI: http://www.open-mpi.org/community/lists/users/2012/02/18565.php Well, in the 1.6 the behaviour has changed: the test case don't hang forever and block an InfiniBand interface, but seem to run through, and now this error message is printed: -- The OpenFabrics (openib) BTL failed to register memory in the driver. Please check /var/log/messages or dmesg for driver specific failure reason. The failure occured here: Local host:mlx4_0 Device:openib_reg_mr Function: Cannot allocate memory() Errno says: You may need to consult with your system administrator to get this problem fixed. -- Looking into FAQ http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages deliver us no hint about what is bad. The locked memory is unlimited: -- pk224850@linuxbdc02:~[502]$ cat /etc/security/limits.conf | grep memlock #- memlock - max locked-in-memory address space (KB) * hardmemlock unlimited * softmemlock unlimited -- Could it still be an Open MPI issue? Are you interested in reproduce this? Best, Paul Kapinos P.S: The same test with Intel MPI cannot run using DAPL, but run very fine opef 'ofa' (= native verbs as Open MPI use it). So I believe the problem is rooted in the communication pattern of the program; it send very LARGE messages to a lot of/all other processes. (The program perform an matrix transposition of a distributed matrix). -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 smime.p7s Description: S/MIME Cryptographic Signature
[OMPI users] Can't read more than 2^31 bytes with MPI_File_read, regardless of type?
Hi: One of our users is reporting trouble reading large files with MPI_File_read (or read_all). With a few different type sizes, to keep count lower than 2^31, the problem persists. A simple C program to test this is attached; we see it in both OpenMPI 1.4.4 and OpenMPI 1.6, with the only difference being the error code returned. We can read the amount of data required by looping over MPI_File_read()s, but in more complicated scenarios this gets awkward. I always thought that the 32-bit signed count limitation wasn't so bad because you could create larger data types to get around it, but this appears not to be the case here. Is this a known problem that we should just work around? Output from ompi_info --all for the 1.4.4 build is also attached. OpenMPI 1.4.4 Trying 268435457 of float, 1073741828 bytes: successfully read 268435457 Trying 536870913 of float, 2147483652 bytes: failed: err=35, MPI_ERR_IO: input/output error Trying 134217729 of double, 1073741832 bytes: successfully read 134217729 Trying 268435457 of double, 2147483656 bytes: failed: err=35, MPI_ERR_IO: input/output error Trying 67108865 of 2xdouble, 1073741840 bytes: successfully read 67108865 Trying 134217729 of 2xdouble, 2147483664 bytes: failed: err=35, MPI_ERR_IO: input/output error Trying 524289 of 256xdouble, 1073743872 bytes: successfully read 524289 Trying 1048577 of 256xdouble, 2147485696 bytes: failed: err=35, MPI_ERR_IO: input/output error Chunk 1/2: Trying 524288 of 256xdouble, chunked, 1073741824 bytes: successfully read 524288 Chunk 2/2: Trying 524289 of 256xdouble, chunked, 1073743872 bytes: successfully read 524289 OpenMPI 1.6 Trying 268435457 of float, 1073741828 bytes: successfully read 268435457 Trying 536870913 of float, 2147483652 bytes: failed: err=13, MPI_ERR_ARG: invalid argument of some other kind Trying 134217729 of double, 1073741832 bytes: successfully read 134217729 Trying 268435457 of double, 2147483656 bytes: failed: err=13, MPI_ERR_ARG: invalid argument of some other kind Trying 67108865 of 2xdouble, 1073741840 bytes: successfully read 67108865 Trying 134217729 of 2xdouble, 2147483664 bytes: failed: err=13, MPI_ERR_ARG: invalid argument of some other kind Trying 524289 of 256xdouble, 1073743872 bytes: successfully read 524289 Trying 1048577 of 256xdouble, 2147485696 bytes: failed: err=13, MPI_ERR_ARG: invalid argument of some other kind Chunk 1/2: Trying 524288 of 256xdouble, chunked, 1073741824 bytes: successfully read 524288 Chunk 2/2: Trying 524289 of 256xdouble, chunked, 1073743872 bytes: successfully read 524289 - Jonathan -- Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca #include #include #include int tryToRead(const MPI_File file, const MPI_Datatype type, const int count, const size_t size, const char *typename, void *buf) { int ierr; MPI_Status status; size_t bufsize = (size_t)count * size; printf("Trying %d of %s, %lu bytes: ", count, typename, bufsize ); ierr = MPI_File_read(file, buf, count, type, ); if ( !ierr ) { int gotcount; MPI_Get_count( , type, ); printf("successfully read %d\n", gotcount); } else { char err[MPI_MAX_ERROR_STRING]; int len; MPI_Error_string(ierr, err, ); printf("failed: err=%d, %s\n", ierr, err); } return ierr; } int tryToReadInChunks(const MPI_File file, const MPI_Datatype type, const int count, const size_t size, const char *typename, void *buf, int nchunks) { int ierr; int nsofar = 0; int chunksize = count / nchunks; char *cbuf = buf; for (int chunk = 0; chunk < nchunks; chunk++ ) { int thischunk = chunksize; if (chunk == nchunks-1) thischunk = count - nsofar; printf("Chunk %d/%d: ", chunk+1, nchunks); ierr = tryToRead(file, type, thischunk, size, typename, &(cbuf[nsofar*size])); if (ierr) break; nsofar += thischunk; } return ierr; } int main(int argc, char *argv[]) { int count; MPI_File fh; MPI_Init(, ); MPI_File_open(MPI_COMM_WORLD, "/dev/zero", MPI_MODE_RDONLY, MPI_INFO_NULL, ); char *buf = malloc( ((size_t)1 << 31)+1024 ); if (buf == NULL) { printf("Malloc failed.\n"); exit(-1); } /* integers */ count = (1 << 28) + 1; tryToRead(fh, MPI_FLOAT, count, sizeof(float), "float", buf); count = (1 << 29) + 1; tryToRead(fh, MPI_FLOAT, count, sizeof(float), "float", buf); /* doubles */ count = (1 << 27) + 1; tryToRead(fh, MPI_DOUBLE, count, sizeof(double), "double", buf); count = (1 << 28) + 1; tryToRead(fh, MPI_DOUBLE, count, sizeof(double), "double", buf); /* 2 x doubles */ MPI_Datatype TwoDoubles; MPI_Type_contiguous(2, MPI_DOUBLE, ); MPI_Type_commit(); count = (1 << 26) + 1; tryToRead(fh, TwoDoubles, count, 2*sizeof(double), "2xdouble", buf); count = (1 << 27) +
Re: [OMPI users] Connection refused with openmpi-1.6.0
Are you setting any MCA parameters, such as btl_tcp_if_include or btl_tcp_if_exclude, perchance? They could be in your environment or in a file, too. I ask because we should be skipping the loopback device by default (i.e., it should be covered by the default value of btl_tcp_if_exclude). What is the output from ifconfig? On Jul 11, 2012, at 12:12 PM, Siegmar Gross wrote: > Hello Reuti, > > thank you for your reply. > >>> I get the following error when I try to run my programs with >>> openmpi-1.6.0. >>> >>> tyr hello_1 52 which mpiexec >>> /usr/local/openmpi-1.6_32_cc/bin/mpiexec >>> tyr hello_1 53 >>> >>> tyr hello_1 51 mpiexec --host tyr,sunpc1 -np 3 hello_1_mpi >>> Process 0 of 3 running on tyr.informatik.hs-fulda.de >>> Process 2 of 3 running on tyr.informatik.hs-fulda.de >>> [[4154,1],0][../../../../../openmpi-1.6/ompi/mca/btl/tcp/btl_tcp_endpoint.c:586:m >>> ca_btl_tcp_endpoint_start_connect] from tyr.informatik.hs-fulda.de to: >>> sunpc1 >>> Unable to connect to the peer 127.0.0.1 on port 1024: Connection refused >>> >>> Process 1 of 3 running on sunpc1.informatik.hs-fulda.de >>> [[4154,1],1][../../../../../openmpi-1.6/ompi/mca/btl/tcp/btl_tcp_endpoint.c:586:m >>> ca_btl_tcp_endpoint_start_connect] from sunpc1.informatik.hs-fulda.de to: >>> tyr >>> Unable to connect to the peer 127.0.0.1 on port 516: Connection refused >> >> Some distributions give the loopback interface also the name of the host. Is >> there an additonal line: >> >> 127.0.0.1 tyr.informatik.hs-fulda.de >> >> in /etc/hosts besides the localhost and interface entry? > > No it isn't. > > tyr etc 16 more hosts > # > # Internet host table > # > ::1 localhost > 127.0.0.1 localhost > ... > > tyr etc 20 ssh sunpc1 head /etc/hosts > 127.0.0.1 localhost > ... > > > Kind regards > > Siegmar > > > > > >>> [sunpc1.informatik.hs-fulda.de:24555] *** An error occurred in MPI_Barrier >>> [sunpc1.informatik.hs-fulda.de:24555] *** on communicator MPI_COMM_WORLD >>> [sunpc1.informatik.hs-fulda.de:24555] *** MPI_ERR_INTERN: internal error >>> [sunpc1.informatik.hs-fulda.de:24555] *** MPI_ERRORS_ARE_FATAL: your MPI >>> job will >>> now abort >>> ... >>> >>> >>> I have no problems with just one host (in this case "127.0.0.1" should >>> work). Why didn't mpiexec use the ip-addresses of the hosts in the >>> above example? >>> >>> >>> tyr hello_1 53 mpiexec --host tyr -np 2 hello_1_mpi >>> Process 0 of 2 running on tyr.informatik.hs-fulda.de >>> Now 1 slave tasks are sending greetings. >>> Greetings from task 1: >>> ... >>> >>> >>> tyr hello_1 54 mpiexec --host sunpc1 -np 2 hello_1_mpi >>> Process 1 of 2 running on sunpc1.informatik.hs-fulda.de >>> Process 0 of 2 running on sunpc1.informatik.hs-fulda.de >>> Now 1 slave tasks are sending greetings. >>> Greetings from task 1: >>> ... >>> >>> >>> The problem doesn't result from the heterogeneity of the two >>> hosts because I get the same error with two Sparc-systems or >>> two PCs. I didn't have any problems with openmpi-1.2.4. >>> >>> tyr hello_1 18 mpiexec -mca btl ^udapl --host tyr,sunpc1,linpc1 \ >>> -np 4 hello_1_mpi >>> Process 0 of 4 running on tyr.informatik.hs-fulda.de >>> Process 2 of 4 running on linpc1 >>> Process 1 of 4 running on sunpc1.informatik.hs-fulda.de >>> Process 3 of 4 running on tyr.informatik.hs-fulda.de >>> Now 3 slave tasks are sending greetings. >>> Greetings from task 2: >>> ... >>> >>> tyr hello_1 19 which mpiexec >>> /usr/local/openmpi-1.2.4/bin/mpiexec >>> >>> Do you have any ideas why it doesn't work with openmpi-1.6.0? >>> I configured the package with >>> >>> ../openmpi-1.6/configure --prefix=/usr/local/openmpi-1.6_32_cc \ >>> LDFLAGS="-m32" \ >>> CC="cc" CXX="CC" F77="f77" FC="f95" \ >>> CFLAGS="-m32" CXXFLAGS="-m32 -library=stlport4" FFLAGS="-m32" \ >>> FCFLAGS="-m32" \ >>> CPP="cpp" CXXCPP="cpp" \ >>> CPPFLAGS="" CXXCPPFLAGS="" \ >>> C_INCL_PATH="" C_INCLUDE_PATH="" CPLUS_INCLUDE_PATH="" \ >>> OBJC_INCLUDE_PATH="" MPIHOME="" \ >>> --without-udapl --without-openib \ >>> --enable-mpi-f90 --with-mpi-f90-size=small \ >>> --enable-heterogeneous --enable-cxx-exceptions \ >>> --enable-orterun-prefix-by-default \ >>> --with-threads=posix --enable-mpi-thread-multiple \ >>> --enable-opal-multi-threads \ >>> --with-hwloc=internal --with-ft=LAM --enable-sparse-groups \ >>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.32_cc >>> >>> Thank you very much for any help in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/