Joseph, Thanks for your response. I'm no expert on Linux so please bear with me. If I understand correctly, using malloc instead of resize should allow me to handle out of memory error properly, but I still see abnormal termination (code is attached).
I have more questions. 1. If the problem is overcommit, (meaning not related to MP I suppose)I, why don't I see it if only MPI 0 calls resize? MPI 0 still overcommits and bac_alloc is caught. 2. In resize, if the returned pointer is null, should it throw some kind of error so the caller could catch and handle that? I don't know the implementation but simply exiting doesn't seem a good idea. Thanks. Best regards, Zhen On Wed, Apr 3, 2019 at 10:02 AM Joseph Schuchart <schuch...@hlrs.de> wrote: > Zhen, > > The "problem" you're running into is memory overcommit [1]. The system > will happily hand you a pointer to memory upon calling malloc without > actually allocating the pages (that's the first step in > std::vector::resize) and then terminate your application as soon as it > tries to actually allocate them if the system runs out of memory. This > happens in std::vector::resize too, which sets each entry in the vector > to it's initial value. There is no way you can catch that. You might > want to try to disable overcommit in the kernel and see if > std::vector::resize throws an exception because malloc fails. > > HTH, > Joseph > > [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting > > On 4/3/19 3:26 PM, Zhen Wang wrote: > > Hi, > > > > I have difficulty catching std::bac_alloc in an MPI environment. The > > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11 > > (x86_64). OpenMPI is built from source. The commands are as follows: > > > > *Build* > > g++ -I<openmpi-4.0.0-opt/include> -L<openmpi-4.0.0-opt/lib> -lmpi > memory.cpp > > > > *Run* > > <openmpi-4.0.0-opt/bin/mpiexec> -n 2 a.out > > > > *Output* > > 0 > > 0 > > 1 > > 1 > > > -------------------------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code. Per user-direction, the job has been aborted. > > > -------------------------------------------------------------------------- > > > -------------------------------------------------------------------------- > > mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05 > > exited on signal 9 (Killed). > > > -------------------------------------------------------------------------- > > > > > > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates > > memory, I'm able to catch bad_alloc as I expected. It seems that I am > > misunderstanding something. Could you please help? Thanks a lot. > > > > > > > > Best regards, > > Zhen > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
#include "mpi.h" #include <iostream> #include <vector> #include <unistd.h> #include <string.h> int main( int argc, char *argv[] ) { MPI_Init( &argc, &argv ); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { double * a[100]; for (long long i = 0; i < 100; i++) { std::cout << i << std::endl; a[i] = (double *)malloc(100000000*sizeof(double)); if (!a[i]) { std::cout << "out" << std::endl; continue; } memset(a[i], 0, 100000000*sizeof(double)); usleep(1000000); } } MPI_Finalize(); return 0; }
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users