[OMPI users] Exit code 65
Hi: I am running an application (OpenFOAM) with 20 processors using Ubuntu 16.04 and occasionally mpirun exits with an exit code of 65. I looked at the documentation and it says: MPI_T_ERR_PVAR_NO_STARTSTOP 65 Variable cannot be started or stopped. The mpi.h file on my machine has the same error code listed. I have no idea what this means. This does not happen all the time, and if I restart the job it usually runs fine to the end. This is a new (to me) rebuilt computer, wondering if it indicates a hardware problem. Thanks, Bill - William C. Lasher Professor Emeritus of Mechanical Engineering Penn State Behrend ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] openMPI and ifort debuggin flags, is it possible?
Diego, Yes, that would be clearly an issue. Cheers, Gilles On Friday, August 3, 2018, Diego Avesani wrote: > Dear Gilles, dear all, > > I do not remember. > I use -r8 when I compile. > > What do you think? > It could be a problem? > > > Thanks a lot > > Diego > > > On 27 July 2018 at 16:05, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Diego, >> >> Did you build OpenMPI with FCFLAGS=-r8 ? >> >> Cheers, >> >> Gilles >> >> >> On Friday, July 27, 2018, Diego Avesani wrote: >> >>> Dear all, >>> >>> I am developing a code for hydrological applications. It is written in >>> FORTRAN and I am using ifort combine with openMPI. >>> >>> In this moment, I am debugging my code due to the fact that I have some >>> NaN errors. As a consequence, I have introduce in my Makefile some flags >>> for the ifort compiler. In particular: >>> >>> -c -r8 -align *-CB -traceback -check all -check uninit -ftrapuv -debug >>> all* -fpp >>> >>> However, this produce some unexpected errors/warning with mpirun. This >>> is the error\warning: >>> >>> Image PCRoutineLine >>>Source >>> MPIHyperStrem 005AA3F0 Unknown Unknown >>> Unknown >>> MPIHyperStrem 00591A5C mod_lathyp_mp_lat 219 >>> LATHYP.f90 >>> MPIHyperStrem 005A0C2A mod_optimizer_mp_ 279 >>> OPTIMIZER.f90 >>> MPIHyperStrem 005986F2 mod_optimizer_mp_ 34 >>> OPTIMIZER.f90 >>> MPIHyperStrem 005A1F84 MAIN__114 >>> MAIN.f90 >>> MPIHyperStrem 0040A46E Unknown Unknown >>> Unknown >>> libc-2.23.so 7FEA758B8830 __libc_start_main Unknown >>> Unknown >>> MPIHyperStrem 0040A369 Unknown Unknown >>> Unknown >>> forrtl: warning (406): fort: (1): In call to SHUFFLE, an array temporary >>> was created for argument #1 >>> >>> >>> >>> My questions is: >>> It is possible to use ifort debugging flags with openMPI? >>> >>> thanks a lot >>> >>> Diego >>> >>> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] local communicator and crash of the code
If your are trying to create a communicator containing all node local processes then use MPI_Comm_split_type. > On Aug 3, 2018, at 12:24 PM, Diego Avesani wrote: > > Deal all, > probably I have found the error. > Let's me check. Probably I have not properly set-up colors. > > Thanks a lot, > I hope that you have not lost too much time for me, > I will let you know If that was the problem. > > Thanks again > > Diego > > >> On 3 August 2018 at 19:57, Diego Avesani wrote: >> Dear R, Dear all, >> >> I do not know. >> I have isolated the issues. It seem that I have some problem with: >> CALL >> MPI_COMM_SPLIT(MPI_COMM_WORLD,colorl,MPIworld%rank,MPI_LOCAL_COMM,MPIworld%iErr) >> CALL MPI_COMM_RANK(MPI_LOCAL_COMM, MPIlocal%rank,MPIlocal%iErr) >> CALL MPI_COMM_SIZE(MPI_LOCAL_COMM, MPIlocal%nCPU,MPIlocal%iErr) >> >> openMPI seems not able to create properly MPIlocal%rank. >> >> what should be? a bug? >> >> thanks again >> >> Diego >> >> >>> On 3 August 2018 at 19:47, Ralph H Castain wrote: >>> Those two command lines look exactly the same to me - what am I missing? >>> >>> On Aug 3, 2018, at 10:23 AM, Diego Avesani wrote: Dear all, I am experiencing a strange error. In my code I use three group communications: MPI_COMM_WORLD MPI_MASTERS_COMM LOCAL_COMM which have in common some CPUs. when I run my code as mpirun -np 4 --oversubscribe ./MPIHyperStrem I have no problem, while when I run it as mpirun -np 4 --oversubscribe ./MPIHyperStrem sometimes it crushes and sometimes not. It seems that all is linked to CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, MPI_SUM, 0, MPI_LOCAL_COMM, iErr) which works with in local. What do you think? Can you please suggestion some debug test? Is a problem related to local communications? Thanks Diego ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users >>> >>> >>> ___ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] local communicator and crash of the code
Deal all, probably I have found the error. Let's me check. Probably I have not properly set-up colors. Thanks a lot, I hope that you have not lost too much time for me, I will let you know If that was the problem. Thanks again Diego On 3 August 2018 at 19:57, Diego Avesani wrote: > Dear R, Dear all, > > I do not know. > I have isolated the issues. It seem that I have some problem with: > CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorl,MPIworld%rank,MPI_ > LOCAL_COMM,MPIworld%iErr) > CALL MPI_COMM_RANK(MPI_LOCAL_COMM, MPIlocal%rank,MPIlocal%iErr) > CALL MPI_COMM_SIZE(MPI_LOCAL_COMM, MPIlocal%nCPU,MPIlocal%iErr) > > openMPI seems not able to create properly MPIlocal%rank. > > what should be? a bug? > > thanks again > > Diego > > > On 3 August 2018 at 19:47, Ralph H Castain wrote: > >> Those two command lines look exactly the same to me - what am I missing? >> >> >> On Aug 3, 2018, at 10:23 AM, Diego Avesani >> wrote: >> >> Dear all, >> >> I am experiencing a strange error. >> >> In my code I use three group communications: >> MPI_COMM_WORLD >> MPI_MASTERS_COMM >> LOCAL_COMM >> >> which have in common some CPUs. >> >> when I run my code as >> mpirun -np 4 --oversubscribe ./MPIHyperStrem >> >> I have no problem, while when I run it as >> >> mpirun -np 4 --oversubscribe ./MPIHyperStrem >> >> sometimes it crushes and sometimes not. >> >> It seems that all is linked to >> CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, >> MPI_SUM, 0, MPI_LOCAL_COMM, iErr) >> >> which works with in local. >> >> What do you think? Can you please suggestion some debug test? >> Is a problem related to local communications? >> >> Thanks >> >> >> >> Diego >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] local communicator and crash of the code
Dear R, Dear all, I do not know. I have isolated the issues. It seem that I have some problem with: CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,colorl,MPIworld%rank,MPI_LOCAL_COMM,MPIworld%iErr) CALL MPI_COMM_RANK(MPI_LOCAL_COMM, MPIlocal%rank,MPIlocal%iErr) CALL MPI_COMM_SIZE(MPI_LOCAL_COMM, MPIlocal%nCPU,MPIlocal%iErr) openMPI seems not able to create properly MPIlocal%rank. what should be? a bug? thanks again Diego On 3 August 2018 at 19:47, Ralph H Castain wrote: > Those two command lines look exactly the same to me - what am I missing? > > > On Aug 3, 2018, at 10:23 AM, Diego Avesani > wrote: > > Dear all, > > I am experiencing a strange error. > > In my code I use three group communications: > MPI_COMM_WORLD > MPI_MASTERS_COMM > LOCAL_COMM > > which have in common some CPUs. > > when I run my code as > mpirun -np 4 --oversubscribe ./MPIHyperStrem > > I have no problem, while when I run it as > > mpirun -np 4 --oversubscribe ./MPIHyperStrem > > sometimes it crushes and sometimes not. > > It seems that all is linked to > CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, > MPI_SUM, 0, MPI_LOCAL_COMM, iErr) > > which works with in local. > > What do you think? Can you please suggestion some debug test? > Is a problem related to local communications? > > Thanks > > > > Diego > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] local communicator and crash of the code
Those two command lines look exactly the same to me - what am I missing? > On Aug 3, 2018, at 10:23 AM, Diego Avesani wrote: > > Dear all, > > I am experiencing a strange error. > > In my code I use three group communications: > MPI_COMM_WORLD > MPI_MASTERS_COMM > LOCAL_COMM > > which have in common some CPUs. > > when I run my code as > mpirun -np 4 --oversubscribe ./MPIHyperStrem > > I have no problem, while when I run it as > > mpirun -np 4 --oversubscribe ./MPIHyperStrem > > sometimes it crushes and sometimes not. > > It seems that all is linked to > CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, > MPI_SUM, 0, MPI_LOCAL_COMM, iErr) > > which works with in local. > > What do you think? Can you please suggestion some debug test? > Is a problem related to local communications? > > Thanks > > > > Diego > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] local communicator and crash of the code
Dear all, I am experiencing a strange error. In my code I use three group communications: MPI_COMM_WORLD MPI_MASTERS_COMM LOCAL_COMM which have in common some CPUs. when I run my code as mpirun -np 4 --oversubscribe ./MPIHyperStrem I have no problem, while when I run it as mpirun -np 4 --oversubscribe ./MPIHyperStrem sometimes it crushes and sometimes not. It seems that all is linked to CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, MPI_SUM, 0, MPI_LOCAL_COMM, iErr) which works with in local. What do you think? Can you please suggestion some debug test? Is a problem related to local communications? Thanks Diego ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] openMPI and ifort debuggin flags, is it possible?
Dear Gilles, dear all, I do not remember. I use -r8 when I compile. What do you think? It could be a problem? Thanks a lot Diego On 27 July 2018 at 16:05, Gilles Gouaillardet wrote: > Diego, > > Did you build OpenMPI with FCFLAGS=-r8 ? > > Cheers, > > Gilles > > > On Friday, July 27, 2018, Diego Avesani wrote: > >> Dear all, >> >> I am developing a code for hydrological applications. It is written in >> FORTRAN and I am using ifort combine with openMPI. >> >> In this moment, I am debugging my code due to the fact that I have some >> NaN errors. As a consequence, I have introduce in my Makefile some flags >> for the ifort compiler. In particular: >> >> -c -r8 -align *-CB -traceback -check all -check uninit -ftrapuv -debug >> all* -fpp >> >> However, this produce some unexpected errors/warning with mpirun. This >> is the error\warning: >> >> Image PCRoutineLine >>Source >> MPIHyperStrem 005AA3F0 Unknown Unknown >> Unknown >> MPIHyperStrem 00591A5C mod_lathyp_mp_lat 219 >> LATHYP.f90 >> MPIHyperStrem 005A0C2A mod_optimizer_mp_ 279 >> OPTIMIZER.f90 >> MPIHyperStrem 005986F2 mod_optimizer_mp_ 34 >> OPTIMIZER.f90 >> MPIHyperStrem 005A1F84 MAIN__114 >> MAIN.f90 >> MPIHyperStrem 0040A46E Unknown Unknown >> Unknown >> libc-2.23.so 7FEA758B8830 __libc_start_main Unknown >> Unknown >> MPIHyperStrem 0040A369 Unknown Unknown >> Unknown >> forrtl: warning (406): fort: (1): In call to SHUFFLE, an array temporary >> was created for argument #1 >> >> >> >> My questions is: >> It is possible to use ifort debugging flags with openMPI? >> >> thanks a lot >> >> Diego >> >> > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Settings oversubscribe as default?
The equivalent MCA param is rmaps_base_oversubscribe=1. You can add OMPI_MCA_rmaps_base_oversubscribe to your environ, or set rmaps_base_oversubscribe in your default MCA param file. > On Aug 3, 2018, at 1:24 AM, Florian Lindner wrote: > > Hello, > > I can use --oversubscribe to enable oversubscribing. What is OpenMPI way to > set this as a default, e.g. through a config file option or an environment > variable? > > Thanks, > Florian > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Comm_connect: Data unpack would read past end of buffer
The buffer being overrun isn’t anything to do with you - it’s an internal buffer used as part of creating the connections. It indicates a problem in OMPI. The 1.10 series is out of the support window, but if you want to stick with it you should at least update to the last release in that series - believe that is 1.10.7. The OMPI v2.x series had problems that don’t support dynamics, so you should skip that one. If you want to come all the way forward, you should take the OMPI v3.x series. Ralph > On Aug 3, 2018, at 3:40 AM, Florian Lindner wrote: > > Hello, > > I have this piece of code: > > MPI_Comm icomm; > INFO << "Accepting connection on " << portName; > MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm); > > and sometimes (like in 1 of 5 runs), I get: > > [helium:33883] [[32673,1],0] ORTE_ERROR_LOG: Data unpack would read past end > of buffer in file dpm_orte.c at line 406 > [helium:33883] *** An error occurred in MPI_Comm_accept > [helium:33883] *** reported by process [2141257729,0] > [helium:33883] *** on communicator MPI_COMM_SELF > [helium:33883] *** MPI_ERR_UNKNOWN: unknown error > [helium:33883] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [helium:33883] ***and potentially your MPI job) > [helium:33883] [0] > func:/usr/lib/libopen-pal.so.13(opal_backtrace_buffer+0x33) [0x7fc1ad0ac6e3] > [helium:33883] [1] func:/usr/lib/libmpi.so.12(ompi_mpi_abort+0x365) > [0x7fc1af4955e5] > [helium:33883] [2] > func:/usr/lib/libmpi.so.12(ompi_mpi_errors_are_fatal_comm_handler+0xe2) > [0x7fc1af487e72] > [helium:33883] [3] func:/usr/lib/libmpi.so.12(ompi_errhandler_invoke+0x145) > [0x7fc1af4874b5] > [helium:33883] [4] func:/usr/lib/libmpi.so.12(MPI_Comm_accept+0x262) > [0x7fc1af4a90e2] > [helium:33883] [5] func:./mpiports() [0x41e43d] > [helium:33883] [6] > func:/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc1ad7a1830] > [helium:33883] [7] func:./mpiports() [0x41b249] > > > Before that I check for the length of portName > > DEBUG << "COMM ACCEPT portName.size() = " << portName.size(); > DEBUG << "MPI_MAX_PORT_NAME = " << MPI_MAX_PORT_NAME; > > which both return 1024. > > I am completely puzzled, how I can get a buffer issue, except something > faulty with std::string portName. > > Any clues? > > Launch command: mpirun -n 4 -mca opal_abort_print_stack 1 > OpenMPI 1.10.2 @ Ubuntu 16. > > Thanks, > Florian > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Comm_connect: Data unpack would read past end of buffer
Hello, I have this piece of code: MPI_Comm icomm; INFO << "Accepting connection on " << portName; MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm); and sometimes (like in 1 of 5 runs), I get: [helium:33883] [[32673,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm_orte.c at line 406 [helium:33883] *** An error occurred in MPI_Comm_accept [helium:33883] *** reported by process [2141257729,0] [helium:33883] *** on communicator MPI_COMM_SELF [helium:33883] *** MPI_ERR_UNKNOWN: unknown error [helium:33883] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [helium:33883] ***and potentially your MPI job) [helium:33883] [0] func:/usr/lib/libopen-pal.so.13(opal_backtrace_buffer+0x33) [0x7fc1ad0ac6e3] [helium:33883] [1] func:/usr/lib/libmpi.so.12(ompi_mpi_abort+0x365) [0x7fc1af4955e5] [helium:33883] [2] func:/usr/lib/libmpi.so.12(ompi_mpi_errors_are_fatal_comm_handler+0xe2) [0x7fc1af487e72] [helium:33883] [3] func:/usr/lib/libmpi.so.12(ompi_errhandler_invoke+0x145) [0x7fc1af4874b5] [helium:33883] [4] func:/usr/lib/libmpi.so.12(MPI_Comm_accept+0x262) [0x7fc1af4a90e2] [helium:33883] [5] func:./mpiports() [0x41e43d] [helium:33883] [6] func:/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fc1ad7a1830] [helium:33883] [7] func:./mpiports() [0x41b249] Before that I check for the length of portName DEBUG << "COMM ACCEPT portName.size() = " << portName.size(); DEBUG << "MPI_MAX_PORT_NAME = " << MPI_MAX_PORT_NAME; which both return 1024. I am completely puzzled, how I can get a buffer issue, except something faulty with std::string portName. Any clues? Launch command: mpirun -n 4 -mca opal_abort_print_stack 1 OpenMPI 1.10.2 @ Ubuntu 16. Thanks, Florian ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Settings oversubscribe as default?
Hello, I can use --oversubscribe to enable oversubscribing. What is OpenMPI way to set this as a default, e.g. through a config file option or an environment variable? Thanks, Florian ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users