Re: [OMPI users] Cannot catch std::bac_alloc?

2019-04-03 Thread Zhen Wang
OK. After help from several forums, I think I understand the cause of the
problem. As Jeff said, it has nothing to do with MPI.

Linux allows over committing (Thanks Joseph). See here
 and here
. In my case, say the machine has
32GB RAM, 2GB is already used and each MPI process is trying to allocate
8GB at a time. Then this happens:

Initial memory usage: 2GB
After the first memory allocation: 18GB
In the second memory allocation, each MPI process thinks it has enough
space for 8GB because of over committing, and writes to it. But that
requires 34GB, exceeding RAM size. So the out of memory killer on Linux
kills one of the MPI process (sends a SIGKILL signal), and the other MPI
process receives a SIGTERM signal.

This also explains my questions above.

The reason this problem doesn't happen on Windows is Windows doesn't allow
over commit

.

Thanks again everyone.

Best regards,
Zhen


On Wed, Apr 3, 2019 at 12:57 PM Jeff Hammond  wrote:

> This is not an MPI problem.  You will likely find StackOverflow to be a
> more effective way to get support on C++ issues.
>
> Jeff
>
> On Wed, Apr 3, 2019 at 8:47 AM Zhen Wang  wrote:
>
>> Joseph,
>>
>> Thanks for your response. I'm no expert on Linux so please bear with me.
>> If I understand correctly, using malloc instead of resize should allow me
>> to handle out of memory error properly, but I still see abnormal
>> termination (code is attached).
>>
>> I have more questions.
>>
>> 1. If the problem is overcommit, (meaning not related to MP I suppose)I,
>> why don't I see it if only MPI 0 calls resize? MPI 0 still overcommits and
>> bac_alloc is caught.
>>
>> 2. In resize, if the returned pointer is null, should it throw some kind
>> of error so the caller could catch and handle that? I don't know the
>> implementation but simply exiting doesn't seem a good idea.
>>
>> Thanks.
>>
>> Best regards,
>> Zhen
>>
>>
>> On Wed, Apr 3, 2019 at 10:02 AM Joseph Schuchart 
>> wrote:
>>
>>> Zhen,
>>>
>>> The "problem" you're running into is memory overcommit [1]. The system
>>> will happily hand you a pointer to memory upon calling malloc without
>>> actually allocating the pages (that's the first step in
>>> std::vector::resize) and then terminate your application as soon as it
>>> tries to actually allocate them if the system runs out of memory. This
>>> happens in std::vector::resize too, which sets each entry in the vector
>>> to it's initial value. There is no way you can catch that. You might
>>> want to try to disable overcommit in the kernel and see if
>>> std::vector::resize throws an exception because malloc fails.
>>>
>>> HTH,
>>> Joseph
>>>
>>> [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
>>>
>>> On 4/3/19 3:26 PM, Zhen Wang wrote:
>>> > Hi,
>>> >
>>> > I have difficulty catching std::bac_alloc in an MPI environment. The
>>> > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11
>>> > (x86_64). OpenMPI is built from source. The commands are as follows:
>>> >
>>> > *Build*
>>> > g++ -I -L -lmpi
>>> memory.cpp
>>> >
>>> > *Run*
>>> >  -n 2 a.out
>>> >
>>> > *Output*
>>> > 0
>>> > 0
>>> > 1
>>> > 1
>>> >
>>> --
>>> > Primary job  terminated normally, but 1 process returned
>>> > a non-zero exit code. Per user-direction, the job has been aborted.
>>> >
>>> --
>>> >
>>> --
>>> > mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05
>>> > exited on signal 9 (Killed).
>>> >
>>> --
>>> >
>>> >
>>> > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates
>>> > memory, I'm able to catch bad_alloc as I expected. It seems that I am
>>> > misunderstanding something. Could you please help? Thanks a lot.
>>> >
>>> >
>>> >
>>> > Best regards,
>>> > Zhen
>>> >
>>> > ___
>>> > users mailing list
>>> > users@lists.open-mpi.org
>>> > https://lists.open-mpi.org/mailman/listinfo/users
>>> >
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list

Re: [OMPI users] Cannot catch std::bac_alloc?

2019-04-03 Thread Jeff Hammond
This is not an MPI problem.  You will likely find StackOverflow to be a
more effective way to get support on C++ issues.

Jeff

On Wed, Apr 3, 2019 at 8:47 AM Zhen Wang  wrote:

> Joseph,
>
> Thanks for your response. I'm no expert on Linux so please bear with me.
> If I understand correctly, using malloc instead of resize should allow me
> to handle out of memory error properly, but I still see abnormal
> termination (code is attached).
>
> I have more questions.
>
> 1. If the problem is overcommit, (meaning not related to MP I suppose)I,
> why don't I see it if only MPI 0 calls resize? MPI 0 still overcommits and
> bac_alloc is caught.
>
> 2. In resize, if the returned pointer is null, should it throw some kind
> of error so the caller could catch and handle that? I don't know the
> implementation but simply exiting doesn't seem a good idea.
>
> Thanks.
>
> Best regards,
> Zhen
>
>
> On Wed, Apr 3, 2019 at 10:02 AM Joseph Schuchart 
> wrote:
>
>> Zhen,
>>
>> The "problem" you're running into is memory overcommit [1]. The system
>> will happily hand you a pointer to memory upon calling malloc without
>> actually allocating the pages (that's the first step in
>> std::vector::resize) and then terminate your application as soon as it
>> tries to actually allocate them if the system runs out of memory. This
>> happens in std::vector::resize too, which sets each entry in the vector
>> to it's initial value. There is no way you can catch that. You might
>> want to try to disable overcommit in the kernel and see if
>> std::vector::resize throws an exception because malloc fails.
>>
>> HTH,
>> Joseph
>>
>> [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
>>
>> On 4/3/19 3:26 PM, Zhen Wang wrote:
>> > Hi,
>> >
>> > I have difficulty catching std::bac_alloc in an MPI environment. The
>> > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11
>> > (x86_64). OpenMPI is built from source. The commands are as follows:
>> >
>> > *Build*
>> > g++ -I -L -lmpi
>> memory.cpp
>> >
>> > *Run*
>> >  -n 2 a.out
>> >
>> > *Output*
>> > 0
>> > 0
>> > 1
>> > 1
>> >
>> --
>> > Primary job  terminated normally, but 1 process returned
>> > a non-zero exit code. Per user-direction, the job has been aborted.
>> >
>> --
>> >
>> --
>> > mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05
>> > exited on signal 9 (Killed).
>> >
>> --
>> >
>> >
>> > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates
>> > memory, I'm able to catch bad_alloc as I expected. It seems that I am
>> > misunderstanding something. Could you please help? Thanks a lot.
>> >
>> >
>> >
>> > Best regards,
>> > Zhen
>> >
>> > ___
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> >
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Cannot catch std::bac_alloc?

2019-04-03 Thread Zhen Wang
Joseph,

Thanks for your response. I'm no expert on Linux so please bear with me. If
I understand correctly, using malloc instead of resize should allow me to
handle out of memory error properly, but I still see abnormal termination
(code is attached).

I have more questions.

1. If the problem is overcommit, (meaning not related to MP I suppose)I,
why don't I see it if only MPI 0 calls resize? MPI 0 still overcommits and
bac_alloc is caught.

2. In resize, if the returned pointer is null, should it throw some kind of
error so the caller could catch and handle that? I don't know the
implementation but simply exiting doesn't seem a good idea.

Thanks.

Best regards,
Zhen


On Wed, Apr 3, 2019 at 10:02 AM Joseph Schuchart  wrote:

> Zhen,
>
> The "problem" you're running into is memory overcommit [1]. The system
> will happily hand you a pointer to memory upon calling malloc without
> actually allocating the pages (that's the first step in
> std::vector::resize) and then terminate your application as soon as it
> tries to actually allocate them if the system runs out of memory. This
> happens in std::vector::resize too, which sets each entry in the vector
> to it's initial value. There is no way you can catch that. You might
> want to try to disable overcommit in the kernel and see if
> std::vector::resize throws an exception because malloc fails.
>
> HTH,
> Joseph
>
> [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
>
> On 4/3/19 3:26 PM, Zhen Wang wrote:
> > Hi,
> >
> > I have difficulty catching std::bac_alloc in an MPI environment. The
> > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11
> > (x86_64). OpenMPI is built from source. The commands are as follows:
> >
> > *Build*
> > g++ -I -L -lmpi
> memory.cpp
> >
> > *Run*
> >  -n 2 a.out
> >
> > *Output*
> > 0
> > 0
> > 1
> > 1
> >
> --
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> >
> --
> >
> --
> > mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05
> > exited on signal 9 (Killed).
> >
> --
> >
> >
> > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates
> > memory, I'm able to catch bad_alloc as I expected. It seems that I am
> > misunderstanding something. Could you please help? Thanks a lot.
> >
> >
> >
> > Best regards,
> > Zhen
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
#include "mpi.h"
#include 
#include 
#include 
#include 

int main( int argc, char *argv[] )
{
  MPI_Init( ,  );

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, );
  if (rank == 0)
  {
double * a[100];
for (long long i = 0; i < 100; i++)
{
  std::cout << i << std::endl;
  a[i] = (double *)malloc(1*sizeof(double));
  if (!a[i])
  {
std::cout << "out" << std::endl;
continue;
  }
  memset(a[i], 0, 1*sizeof(double));
  usleep(100);
}
  }

  MPI_Finalize();
  return 0;
}
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Cannot catch std::bac_alloc?

2019-04-03 Thread Joseph Schuchart

Zhen,

The "problem" you're running into is memory overcommit [1]. The system 
will happily hand you a pointer to memory upon calling malloc without 
actually allocating the pages (that's the first step in 
std::vector::resize) and then terminate your application as soon as it 
tries to actually allocate them if the system runs out of memory. This 
happens in std::vector::resize too, which sets each entry in the vector 
to it's initial value. There is no way you can catch that. You might 
want to try to disable overcommit in the kernel and see if 
std::vector::resize throws an exception because malloc fails.


HTH,
Joseph

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

On 4/3/19 3:26 PM, Zhen Wang wrote:

Hi,

I have difficulty catching std::bac_alloc in an MPI environment. The 
code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11 
(x86_64). OpenMPI is built from source. The commands are as follows:


*Build*
g++ -I -L -lmpi memory.cpp

*Run*
 -n 2 a.out

*Output*
0
0
1
1
--
Primary jobĀ  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05 
exited on signal 9 (Killed).

--


If I uncomment the lineĀ //if (rank == 0), i.e., only rank 0 allocates 
memory, I'm able to catch bad_alloc as I expected. It seems that I am 
misunderstanding something. Could you please help? Thanks a lot.




Best regards,
Zhen

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users