Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Jeff Hammond
If you’re reporting a bug and have a reproducers, I recommend creating the
github issue and only posting on the user list if you don’t get the
attention you want there.

Best,

Jeff

On Sat, Mar 16, 2019 at 1:16 PM Thomas Pak 
wrote:

> Dear Jeff,
>
> I did find a way to circumvent this issue for my specific application by
> spawning less frequently. However, I wanted to at least bring attention to
> this issue for the OpenMPI community, as it can be reproduced with an
> alarmingly simple program.
>
> Perhaps the user's mailing list is not the ideal place for this. Would you
> recommend that I report this issue on the developer's mailing list or open
> a GitHub issue?
>
> Best wishes,
> Thomas Pak
>
> On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
>
> Is there perhaps a different way to solve your problem that doesn’t spawn
> so much as to hit this issue?
>
> I’m not denying there’s an issue here, but in a world of finite human
> effort and fallible software, sometimes it’s easiest to just avoid the bugs
> altogether.
>
> Jeff
>
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak 
> wrote:
>
> Dear all,
>
> Does anyone have any clue on what the problem could be here? This seems to
> be a persistent problem present in all currently supported OpenMPI releases
> and indicates that there is a fundamental flaw in how OpenMPI handles
> dynamic process creation.
>
> Best wishes,
> Thomas Pak
>
> *From: *"Thomas Pak" 
> *To: *users@lists.open-mpi.org
> *Sent: *Friday, 7 December, 2018 17:51:29
> *Subject: *[OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>
> Dear all,
>
> My MPI application spawns a large number of MPI processes using
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced
> that this results in problems for all currently supported OpenMPI versions
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in
> C (included below) that spawns child processes using MPI_Comm_spawn in an
> infinite loop, where each child process exits after writing a message to
> stdout. This short program leads to the following issues:
>
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the
> program leads to a pipe leak where pipes keep accumulating over time until
> my MPI application crashes because the maximum number of pipes has been
> reached.
>
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to
> be no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>
> In version 4.0.0 (compiled from source), I have not been able to test this
> issue very thoroughly because mpiexec ignores the --oversubscribe
> command-line flag (as detailed in this GitHub issue
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the
> oversubscription of processor cores, which means that spawning additional
> processes immediately results in an error because "not enough slots" are
> available. A fix for this was proposed recently (
> https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x
> developer branch is being actively developed right now, I decided not go
> into it.
>
> I have found one e-mail thread on this mailing list about a similar
> problem (
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In
> this thread, Ralph Castain states that this is a known issue and suggests
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3
> is no longer supported and the issue has reappeared, hence this did not
> solve the issue.
>
> I have created a GitHub gist that contains the output from "ompi_info
> --all" of all the OpenMPI installations mentioned here, as well as the
> config.log files for the OpenMPI installations that I compiled from source:
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
>
> I have also attached the code for the short program that demonstrates
> these issues. For good measure, I have included it directly here as well:
>
> """
> #include 
> #include 
>
> int main(int argc, char *argv[]) {
>
> // Initialize MPI
> MPI_Init(NULL, NULL);
>
> // Get parent
> MPI_Comm parent;
> MPI_Comm_get_parent(&parent);
>
> // If the process was not spawned
> if (parent == MPI_COMM_NULL) {
>
> puts("I was not spawned!");
>
> // Spawn child process in loop
> char *cmd = argv[0];
> char **cmd_argv = MPI_ARGV_NULL;
> int maxprocs = 1;
> MPI_Info info = MPI_INFO_NULL;
> int root = 0;
> MPI_Comm comm = MPI_COMM_SELF;
> MPI_Comm intercomm;
> int *array_of_errcodes = MPI_ERRCODES_IGNORE;
>
> for (;;) {
> MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
> &intercomm, array_of_errcodes);
>
> MPI_Comm_disconnect(&intercomm);
> }
>
> // If process was spawned
> } else {
>
> puts("I was spawned!");
>
> MPI_Comm_disconnect(&parent);
> }
>
> // Finalize
> MPI_Finalize();
>
> }
> """
>
> Thanks in advance and best wishes,
> Thomas Pak
>
> __

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Gilles Gouaillardet
FWIW I could observe some memory leaks on both mpirun and MPI task 0 with the 
latest master branch.

So I guess mileage varies depending on available RAM and number of iterations.

Sent from my iPod

> On Mar 17, 2019, at 20:47, Riebs, Andy  wrote:
> 
> Thomas, your test case is somewhat similar to a bash fork() bomb -- not the 
> same, but similar. After running one of your failing jobs, you might check to 
> see if the “out-of-memory” (“OOM”) killer has been invoked. If it has, that 
> can lead to unexpected consequences, such as what you’ve reported.
>  
> An easy way to check would be
> $ nodes=${ job’s node list }
> $  pdsh  -w $nodes dmesg  -T \|  grep  \"Out of memory\" 2>/dev/null
>  
> Andy
>  
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Thomas Pak
> Sent: Saturday, March 16, 2019 4:14 PM
> To: Open MPI Users 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear Jeff,
>  
> I did find a way to circumvent this issue for my specific application by 
> spawning less frequently. However, I wanted to at least bring attention to 
> this issue for the OpenMPI community, as it can be reproduced with an 
> alarmingly simple program.
>  
> Perhaps the user's mailing list is not the ideal place for this. Would you 
> recommend that I report this issue on the developer's mailing list or open a 
> GitHub issue?
>  
> Best wishes,
> Thomas Pak
>  
> On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
> Is there perhaps a different way to solve your problem that doesn’t spawn so 
> much as to hit this issue?
>  
> I’m not denying there’s an issue here, but in a world of finite human effort 
> and fallible software, sometimes it’s easiest to just avoid the bugs 
> altogether.
>  
> Jeff
>  
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak  wrote:
> Dear all,
>  
> Does anyone have any clue on what the problem could be here? This seems to be 
> a persistent problem present in all currently supported OpenMPI releases and 
> indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
> process creation.
>  
> Best wishes,
> Thomas Pak
>  
> From: "Thomas Pak" 
> To: users@lists.open-mpi.org
> Sent: Friday, 7 December, 2018 17:51:29
> Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear all,
>  
> My MPI application spawns a large number of MPI processes using 
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced 
> that this results in problems for all currently supported OpenMPI versions 
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in C 
> (included below) that spawns child processes using MPI_Comm_spawn in an 
> infinite loop, where each child process exits after writing a message to 
> stdout. This short program leads to the following issues:
>  
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
> program leads to a pipe leak where pipes keep accumulating over time until my 
> MPI application crashes because the maximum number of pipes has been reached.
>  
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be 
> no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>  
> In version 4.0.0 (compiled from source), I have not been able to test this 
> issue very thoroughly because mpiexec ignores the --oversubscribe 
> command-line flag (as detailed in this GitHub issue 
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
> oversubscription of processor cores, which means that spawning additional 
> processes immediately results in an error because "not enough slots" are 
> available. A fix for this was proposed recently 
> (https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
> branch is being actively developed right now, I decided not go into it.
>  
> I have found one e-mail thread on this mailing list about a similar problem 
> (https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In 
> this thread, Ralph Castain states that this is a known issue and suggests 
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3 is 
> no longer supported and the issue has reappeared, hence this did not solve 
> the issue.
>  
> I have created a GitHub gist that contains the output from "ompi_info --all" 
> of all the OpenMPI installations mentioned here, as well as the config.log 
> files for the OpenMPI installations that I compiled from source: 
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
>  
> I have also attached the code for the short program that demonstrates these 
> issues. For good measure, I have included it directly here as well:
>  
> """
> #include 
> #include 
>  
> int main(int argc, char *argv[]) {
>  
> // Initialize MPI
> MPI_Init(NULL, NULL);
>  
> // Get parent
> MPI_Comm parent;
> MPI_Comm_get_

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Riebs, Andy
Thomas, your test case is somewhat similar to a bash fork() bomb -- not the 
same, but similar. After running one of your failing jobs, you might check to 
see if the “out-of-memory” (“OOM”) killer has been invoked. If it has, that can 
lead to unexpected consequences, such as what you’ve reported.

An easy way to check would be
$ nodes=${ job’s node list }
$  pdsh  -w $nodes dmesg  -T \|  grep  \"Out of memory\" 2>/dev/null

Andy

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Thomas Pak
Sent: Saturday, March 16, 2019 4:14 PM
To: Open MPI Users 
Cc: Open MPI Users 
Subject: Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

Dear Jeff,

I did find a way to circumvent this issue for my specific application by 
spawning less frequently. However, I wanted to at least bring attention to this 
issue for the OpenMPI community, as it can be reproduced with an alarmingly 
simple program.

Perhaps the user's mailing list is not the ideal place for this. Would you 
recommend that I report this issue on the developer's mailing list or open a 
GitHub issue?

Best wishes,
Thomas Pak

On Mar 16 2019, at 7:40 pm, Jeff Hammond 
mailto:jeff.scie...@gmail.com>> wrote:
Is there perhaps a different way to solve your problem that doesn’t spawn so 
much as to hit this issue?

I’m not denying there’s an issue here, but in a world of finite human effort 
and fallible software, sometimes it’s easiest to just avoid the bugs altogether.

Jeff

On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak 
mailto:thomas@maths.ox.ac.uk>> wrote:
Dear all,

Does anyone have any clue on what the problem could be here? This seems to be a 
persistent problem present in all currently supported OpenMPI releases and 
indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
process creation.

Best wishes,
Thomas Pak

From: "Thomas Pak" mailto:thomas@maths.ox.ac.uk>>
To: users@lists.open-mpi.org
Sent: Friday, 7 December, 2018 17:51:29
Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

Dear all,

My MPI application spawns a large number of MPI processes using MPI_Comm_spawn 
over its total lifetime. Unfortunately, I have experienced that this results in 
problems for all currently supported OpenMPI versions (2.1, 3.0, 3.1 and 4.0). 
I have written a short, self-contained program in C (included below) that 
spawns child processes using MPI_Comm_spawn in an infinite loop, where each 
child process exits after writing a message to stdout. This short program leads 
to the following issues:

In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
program leads to a pipe leak where pipes keep accumulating over time until my 
MPI application crashes because the maximum number of pipes has been reached.

In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be no 
pipe leak, but the program crashes with the following error message:
PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257

In version 4.0.0 (compiled from source), I have not been able to test this 
issue very thoroughly because mpiexec ignores the --oversubscribe command-line 
flag (as detailed in this GitHub issue 
https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
oversubscription of processor cores, which means that spawning additional 
processes immediately results in an error because "not enough slots" are 
available. A fix for this was proposed recently 
(https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
branch is being actively developed right now, I decided not go into it.

I have found one e-mail thread on this mailing list about a similar problem 
(https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In this 
thread, Ralph Castain states that this is a known issue and suggests that it is 
fixed in the then upcoming v1.3.x release. However, version 1.3 is no longer 
supported and the issue has reappeared, hence this did not solve the issue.

I have created a GitHub gist that contains the output from "ompi_info --all" of 
all the OpenMPI installations mentioned here, as well as the config.log files 
for the OpenMPI installations that I compiled from source: 
https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.

I have also attached the code for the short program that demonstrates these 
issues. For good measure, I have included it directly here as well:

"""
#include 
#include 

int main(int argc, char *argv[]) {

// Initialize MPI
MPI_Init(NULL, NULL);

// Get parent
MPI_Comm parent;
MPI_Comm_get_parent(&parent);

// If the process was not spawned
if (parent == MPI_COMM_NULL) {

puts("I was not spawned!");

// Spawn child process in loop
char *cmd = argv[0];
char **cmd_argv = MPI_ARGV_NULL;
int maxprocs = 1;
MPI_Info info = MPI_INFO_NULL;
int root = 0;
MPI_Comm comm = MPI_COMM_SELF;
MPI_Comm intercomm;
int *array_of_errcodes = MPI_ERRCODES_IGNORE;

for (;;) {
M