Joseph,

Thanks for your response. I'm no expert on Linux so please bear with me. If
I understand correctly, using malloc instead of resize should allow me to
handle out of memory error properly, but I still see abnormal termination
(code is attached).

I have more questions.

1. If the problem is overcommit, (meaning not related to MP I suppose)I,
why don't I see it if only MPI 0 calls resize? MPI 0 still overcommits and
bac_alloc is caught.

2. In resize, if the returned pointer is null, should it throw some kind of
error so the caller could catch and handle that? I don't know the
implementation but simply exiting doesn't seem a good idea.

Thanks.

Best regards,
Zhen


On Wed, Apr 3, 2019 at 10:02 AM Joseph Schuchart <schuch...@hlrs.de> wrote:

> Zhen,
>
> The "problem" you're running into is memory overcommit [1]. The system
> will happily hand you a pointer to memory upon calling malloc without
> actually allocating the pages (that's the first step in
> std::vector::resize) and then terminate your application as soon as it
> tries to actually allocate them if the system runs out of memory. This
> happens in std::vector::resize too, which sets each entry in the vector
> to it's initial value. There is no way you can catch that. You might
> want to try to disable overcommit in the kernel and see if
> std::vector::resize throws an exception because malloc fails.
>
> HTH,
> Joseph
>
> [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
>
> On 4/3/19 3:26 PM, Zhen Wang wrote:
> > Hi,
> >
> > I have difficulty catching std::bac_alloc in an MPI environment. The
> > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11
> > (x86_64). OpenMPI is built from source. The commands are as follows:
> >
> > *Build*
> > g++ -I<openmpi-4.0.0-opt/include> -L<openmpi-4.0.0-opt/lib> -lmpi
> memory.cpp
> >
> > *Run*
> > <openmpi-4.0.0-opt/bin/mpiexec> -n 2 a.out
> >
> > *Output*
> > 0
> > 0
> > 1
> > 1
> >
> --------------------------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05
> > exited on signal 9 (Killed).
> >
> --------------------------------------------------------------------------
> >
> >
> > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates
> > memory, I'm able to catch bad_alloc as I expected. It seems that I am
> > misunderstanding something. Could you please help? Thanks a lot.
> >
> >
> >
> > Best regards,
> > Zhen
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
#include "mpi.h"
#include <iostream>
#include <vector>
#include <unistd.h>
#include <string.h>

int main( int argc, char *argv[] )
{
  MPI_Init( &argc, &argv );

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if (rank == 0)
  {
    double * a[100];
    for (long long i = 0; i < 100; i++)
    {
      std::cout << i << std::endl;
      a[i] = (double *)malloc(100000000*sizeof(double));
      if (!a[i])
      {
        std::cout << "out" << std::endl;
        continue;
      }
      memset(a[i], 0, 100000000*sizeof(double));
      usleep(1000000);
    }
  }

  MPI_Finalize();
  return 0;
}
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to