[OMPI users] MPI_Comm_spawn issues

2023-04-11 Thread Sergio Iserte via users
Hi,
I am evaluating OpenMPI 5.0.0 and I am experiencing a race condition when 
spawning a different number of processes in different nodes. 

With:


$cat hostfile
node00
node01
node02
node03


If I run this code:


#include 
#include 
#include 
int main(int argc, char* argv[]){
MPI_Init(&argc, &argv);
MPI_Comm intercomm;
int final_nranks = 4, len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(name, &len);
MPI_Comm_get_parent(&intercomm);
if(intercomm == MPI_COMM_NULL){
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "hostfile", "hostfile");
MPI_Info_set(info, "map_by", "ppr:1:node");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, final_nranks, info, 0, 
MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
printf("PARENT %s\n", name);
} else {
printf("CHILD %s\n", name);
}
MPI_Finalize();
return 0;
}


With the command:


$ mpirun  -np 2 --hostfile hostfile --map-by node ./a.out


Sometimes I get this (that it is what I wanted, but without PMIX errors):


[node00:281361] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 
1034
[node00:281361] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 1839
PARENT node00
CHILD node00
PARENT node01
CHILD node01
CHILD node02
CHILD node03

However, in other executions I get the following output:

[node00:281468] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 
1034
[node00:281468] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 1839
--
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-node00-281468@0,0] on node node00
  Remote daemon: [prterun-node00-281468@0,2] on node node01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--
[node00:0] *** An error occurred in Socket closed
[node00:0] *** reported by process [3933011970,0]
[node00:0] *** on a NULL communicator
[node00:0] *** Unknown error
[node00:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node00:0] ***and MPI will try to terminate your MPI job as well)


I also submitted the issue in Github: 
https://github.com/open-mpi/ompi/issues/11421

Any help is appreciatted, even if it is in the shape of hints to hack some 
parts of the code that may be causing this issue.

Thank you.


[OMPI users] MPI_COMM_SPAWN() cannot spawn across nodes

2021-12-07 Thread Jarunan Panyasantisuk via users

Hi there,

I have an issue in OpenMPI 4.0.2 and 4.1.1 that MPI_COMM_SPAWN() cannot 
spawn across nodes, while I could successfully use this function in 
OpenMPI 2.1.1 I am testing on a cluster with CentOS 7.9, LSF Batch 
system, and GCC 6.3.0.


I used this code for testing (called it "spawn_example.c")

   |#include "mpi.h" #include  #include  #define
   NUM_SPAWNS 3 int main( int argc, char *argv[] ) { int np =
   NUM_SPAWNS; int errcodes[NUM_SPAWNS]; MPI_Comm parentcomm,
   intercomm; MPI_Init( &argc, &argv ); MPI_Comm_get_parent(
   &parentcomm ); if (parentcomm == MPI_COMM_NULL) { /* Create 3 more
   processes - this example must be called spawn_example.exe for this
   to work. */ MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, np,
   MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
   printf("I'm the parent.\n"); } else { printf("I'm the spawned.\n");
   } fflush(stdout); MPI_Finalize(); return 0; } |

Running on one node, it looked fine:

   |$ bsub -n 6 -I "mpirun -n 1 spawn_example" MPI job. Job <195486300>
   is submitted to queue . <>
   <> I'm the spawned. I'm the spawned. I'm the
   spawned. I'm the parent.|

But on 2 nodes, an error occured:

   |$ bsub -n 6 -R "span[ptile=3]" -I "mpirun -n 1 spawn_example" MPI
   job. Job <195486678> is submitted to queue . <> <> [eu-a2p-217:30058]
   pml_ucx.c:175 Error: Failed to receive UCX worker address: Not found
   (-13) [eu-a2p-217:30058] [[18089,2],2] ORTE_ERROR_LOG: Error in file
   dpm/dpm.c at line 493
   --
   It looks like MPI_INIT failed for some reason; your parallel process
   is likely to abort. There are many reasons that a parallel process
   can fail during MPI_INIT; some of which are due to configuration or
   environment problems. This failure appears to be an internal
   failure; here's some additional information (which may only be
   relevant to an Open MPI developer): ompi_dpm_dyn_init() failed -->
   Returned "Error" (-1) instead of "Success" (0)
   --
   [eu-a2p-217:30058] *** An error occurred in MPI_Init
   [eu-a2p-217:30058] *** reported by process [1185480706,2]
   [eu-a2p-217:30058] *** on a NULL communicator [eu-a2p-217:30058] ***
   Unknown error [eu-a2p-217:30058] *** MPI_ERRORS_ARE_FATAL (processes
   in this communicator will now abort, [eu-a2p-217:30058] *** and
   potentially your MPI job) [eu-a2p-274:107025] PMIX ERROR:
   UNREACHABLE in file server/pmix_server.c at line 2147 |

   ||

I will greatly appreciate your advice. I have read threads with similar 
question but I did not find solutions there,


Best regards,
Jarunan


Re: [OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2020-03-16 Thread Ralph Castain via users
Sorry for the incredibly late reply. Hopefully, you have already managed to 
find the answer.

I'm not sure what your comm_spawn command looks like, but it appears you 
specified the host in it using the "dash_host" info-key, yes? The problem is 
that this is interpreted the same way as the "-host n001.cluster.com 
 " option on an mpiexec cmd line - which means that it 
only allocates _one_ slot to the request. If you are asking to spawn two procs, 
then you don't have adequate resources. One way to check is to only spawn one 
proc with your comm_spawn request and see if that works.

If you want to specify the host, then you need to append the number of slots to 
allocate on that host - e.g., "n001.cluster.com  :2". 
Of course, you cannot allocate more than the system provided minus the number 
currently in use. There are additional modifiers you can pass to handle 
variable numbers of slots.

HTH
Ralph


On Oct 25, 2019, at 5:30 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
 to force one manager process per node.
  Here is my PBS/Torque qsub command:
 $ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash
 I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.
   Here is my mpiexec command within the MyManager.bash script.
 mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe
 I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node. 
   When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:
 ==   ALLOCATED NODES   ==
    n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=
--
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com  
 Verify that you have mapped the allocated resources properly for the
indicated specification.
--
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes
   It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.
 I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged. 
Any ideas?
 My cluster is running CentOS 7.4 and I am using the Portland Group C++ 
compiler.



[OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2019-10-25 Thread Mccall, Kurt E. (MSFC-EV41) via users
I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn

to force one manager process per node.


Here is my PBS/Torque qsub command:

$ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash

I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.



Here is my mpiexec command within the MyManager.bash script.

mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe

I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node.



When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:

==   ALLOCATED NODES   ==
n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=
--
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com

Verify that you have mapped the allocated resources properly for the
indicated specification.
--
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes



It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.

I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged.
Any ideas?

My cluster is running CentOS 7.4 and I am using the Portland Group C++ compiler.


Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-07 Thread Ralph Castain via users
Yeah, we do currently require that to be true. Process mapping is distributed 
across the daemons - i.e., the daemon on each node independently computes the 
map. We have talked about picking up the hostfile on the head node and sending 
out the contents, but haven't implemented that yet.


On Aug 7, 2019, at 2:46 PM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote:

Ralph,
 I got MPI_Comm_spawn to work by making sure that the hostfiles on the head 
(where mpiexec is called) and the remote node are identical.  I had assumed 
that only the one on the head node was read by OpenMPI.   Is this correct?
 Thanks,
Kurt
 From: Ralph Castain mailto:r...@open-mpi.org> > 

Subject: [EXTERNAL] Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already 
filled
 I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?
 https://www.open-mpi.org/nightly/master/
 Ralph
 

On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hi,
 MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?
 For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
Here is the full error message.   Note the Max Slots: 0 message therein (?):
 Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 2    Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A
 =
Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 0Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
 =
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***    and potentially your MPI job)
Here is my mpiexec command:
 mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager
Here is my hostfile “MyNodeFile”:
 n001.cluster.com slots=2 max_slots=2
Here is my SpawnTestManager code:
 
#include 
#include 
#include 
 #ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"
 using std::string;
using std::cout;
using std::endl;
 int main(int argc, char *argv[])
{
    int rank, world_size;
    char *argv2[2];
    MPI_Comm mpi_comm;
    MPI_Info info;
    char host[MPI_MAX_PROCESSOR_NAME + 1];
    int host_name_len;
 string worker_cmd = "SpawnTestWorker";
    string host_name = "n001.cluster.com";
 argv2[0] = "dummy_arg";
    argv2[1] = NULL;
 MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 MPI_Get_processor_name(host, &host_name_len);
    cout << "Host name from MPI_Get_processor_name is " << host << endl;
    char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", host_name.c_str());
    MPI_Info_set(info, "map-by", info_str);
 MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
    &mpi_comm, MPI_ERRCODES_IGNORE);
    MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);
 std::cout << "Manager success!" << std::endl;
 MPI_Finalize();
    return 0;
}

   Here is my Spawn

[OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
Ralph,

I got MPI_Comm_spawn to work by making sure that the hostfiles on the head 
(where mpiexec is called) and the remote node are identical.  I had assumed 
that only the one on the head node was read by OpenMPI.   Is this correct?

Thanks,
Kurt

From: Ralph Castain 

Subject: [EXTERNAL] Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already 
filled

I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_nightly_master_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=KOjBOU3R8SYRlORpTU4f1S89BfzgobqHLEMS3VC_jq8&e=>

Ralph



On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

Hi,

MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47743425_controlling-2Dnode-2Dmapping-2Dof-2Dmpi-2Dcomm-2Dspawn&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=XtyQHmJFY97-e9umj4ROKIlvbglN7fZx-2FTdawoMaY&e=>




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile “MyNodeFile”:

n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c&e=>
 slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = 
"n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c&e=>";

argv2[0] = "dummy_arg";
arg

Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Ralph Castain via users
I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/

Ralph


On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi,
 MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?
 For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
Here is the full error message.   Note the Max Slots: 0 message therein (?):
 Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 2    Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A
 =
Data for JOB [39020,1] offset 0 Total slots allocated 22
    JOB MAP   
 Data for node: n001    Num slots: 2    Max slots: 0Num procs: 1
    Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
 =
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***    and potentially your MPI job)
Here is my mpiexec command:
 mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager
Here is my hostfile “MyNodeFile”:
 n001.cluster.com   slots=2 max_slots=2
Here is my SpawnTestManager code:
 
#include 
#include 
#include 
 #ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"
 using std::string;
using std::cout;
using std::endl;
 int main(int argc, char *argv[])
{
    int rank, world_size;
    char *argv2[2];
    MPI_Comm mpi_comm;
    MPI_Info info;
    char host[MPI_MAX_PROCESSOR_NAME + 1];
    int host_name_len;
 string worker_cmd = "SpawnTestWorker";
    string host_name = "n001.cluster.com  ";
 argv2[0] = "dummy_arg";
    argv2[1] = NULL;
 MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 MPI_Get_processor_name(host, &host_name_len);
    cout << "Host name from MPI_Get_processor_name is " << host << endl;
    char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", host_name.c_str());
    MPI_Info_set(info, "map-by", info_str);
 MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
    &mpi_comm, MPI_ERRCODES_IGNORE);
    MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);
 std::cout << "Manager success!" << std::endl;
 MPI_Finalize();
    return 0;
}

   Here is my SpawnTestWorker code:
 
#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include 
 int main(int argc, char *argv[])
{
    int world_size, rank;
    MPI_Comm manager_intercom;
 MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 MPI_Comm_get_parent(&manager_intercom);
    MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);
 std::cout << "Worker success!" << std::endl;
 MPI_Finalize();
    return 0;
}

 My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js
 I’ve attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn’t 
accept the “ompi full” arguments, and the “-all” arg doesn’t produce much 
output.
 Thanks for your help,
Kurt
 _

[OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Mccall, Kurt E. (MSFC-EV41) via users
Hi,

MPI_Comm_spawn() is failing with the error message "All nodes which are 
allocated for this job are already filled".   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile "MyNodeFile":

n001.cluster.com slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = "n001.cluster.com";

argv2[0] = "dummy_arg";
argv2[1] = NULL;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

MPI_Get_processor_name(host, &host_name_len);
cout << "Host name from MPI_Get_processor_name is " << host << endl;

   char info_str[64];
sprintf(info_str, "ppr:%d:node", 1);
MPI_Info_create(&info);
MPI_Info_set(info, "host", host_name.c_str());
MPI_Info_set(info, "map-by", info_str);

MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
&mpi_comm, MPI_ERRCODES_IGNORE);
MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Manager success!" << std::endl;

MPI_Finalize();
return 0;
}




Here is my SpawnTestWorker code:


#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include 

int main(int argc, char *argv[])
{
int world_size, rank;
MPI_Comm manager_intercom;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

MPI_Comm_get_parent(&manager_intercom);
MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Worker success!" << std::endl;

MPI_Finalize();
return 0;
}


My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js

I've attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn't 
accept the "ompi full" arguments, and the "-all" arg doesn't produce much 
output.

Thanks for your help,
Kurt









___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Jeff Hammond
If you’re reporting a bug and have a reproducers, I recommend creating the
github issue and only posting on the user list if you don’t get the
attention you want there.

Best,

Jeff

On Sat, Mar 16, 2019 at 1:16 PM Thomas Pak 
wrote:

> Dear Jeff,
>
> I did find a way to circumvent this issue for my specific application by
> spawning less frequently. However, I wanted to at least bring attention to
> this issue for the OpenMPI community, as it can be reproduced with an
> alarmingly simple program.
>
> Perhaps the user's mailing list is not the ideal place for this. Would you
> recommend that I report this issue on the developer's mailing list or open
> a GitHub issue?
>
> Best wishes,
> Thomas Pak
>
> On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
>
> Is there perhaps a different way to solve your problem that doesn’t spawn
> so much as to hit this issue?
>
> I’m not denying there’s an issue here, but in a world of finite human
> effort and fallible software, sometimes it’s easiest to just avoid the bugs
> altogether.
>
> Jeff
>
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak 
> wrote:
>
> Dear all,
>
> Does anyone have any clue on what the problem could be here? This seems to
> be a persistent problem present in all currently supported OpenMPI releases
> and indicates that there is a fundamental flaw in how OpenMPI handles
> dynamic process creation.
>
> Best wishes,
> Thomas Pak
>
> *From: *"Thomas Pak" 
> *To: *users@lists.open-mpi.org
> *Sent: *Friday, 7 December, 2018 17:51:29
> *Subject: *[OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>
> Dear all,
>
> My MPI application spawns a large number of MPI processes using
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced
> that this results in problems for all currently supported OpenMPI versions
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in
> C (included below) that spawns child processes using MPI_Comm_spawn in an
> infinite loop, where each child process exits after writing a message to
> stdout. This short program leads to the following issues:
>
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the
> program leads to a pipe leak where pipes keep accumulating over time until
> my MPI application crashes because the maximum number of pipes has been
> reached.
>
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to
> be no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>
> In version 4.0.0 (compiled from source), I have not been able to test this
> issue very thoroughly because mpiexec ignores the --oversubscribe
> command-line flag (as detailed in this GitHub issue
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the
> oversubscription of processor cores, which means that spawning additional
> processes immediately results in an error because "not enough slots" are
> available. A fix for this was proposed recently (
> https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x
> developer branch is being actively developed right now, I decided not go
> into it.
>
> I have found one e-mail thread on this mailing list about a similar
> problem (
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In
> this thread, Ralph Castain states that this is a known issue and suggests
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3
> is no longer supported and the issue has reappeared, hence this did not
> solve the issue.
>
> I have created a GitHub gist that contains the output from "ompi_info
> --all" of all the OpenMPI installations mentioned here, as well as the
> config.log files for the OpenMPI installations that I compiled from source:
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
>
> I have also attached the code for the short program that demonstrates
> these issues. For good measure, I have included it directly here as well:
>
> """
> #include 
> #include 
>
> int main(int argc, char *argv[]) {
>
> // Initialize MPI
> MPI_Init(NULL, NULL);
>
> // Get parent
> MPI_Comm parent;
> MPI_Comm_get_parent(&parent);
>
> // If the process was not spawned
> if (parent == MPI_COMM_NULL) {
>
> puts("I was not spawned!");
>
> // Spawn child process in loop
> char *cmd = argv[0];
> char **cmd_argv = MPI_ARGV_NULL;
> int maxprocs = 1;
> MPI_Info info = MPI_INFO_NULL;
> int root = 0;
> MPI_Comm comm = MPI_COMM_SELF;
> MPI_Comm

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Gilles Gouaillardet
FWIW I could observe some memory leaks on both mpirun and MPI task 0 with the 
latest master branch.

So I guess mileage varies depending on available RAM and number of iterations.

Sent from my iPod

> On Mar 17, 2019, at 20:47, Riebs, Andy  wrote:
> 
> Thomas, your test case is somewhat similar to a bash fork() bomb -- not the 
> same, but similar. After running one of your failing jobs, you might check to 
> see if the “out-of-memory” (“OOM”) killer has been invoked. If it has, that 
> can lead to unexpected consequences, such as what you’ve reported.
>  
> An easy way to check would be
> $ nodes=${ job’s node list }
> $  pdsh  -w $nodes dmesg  -T \|  grep  \"Out of memory\" 2>/dev/null
>  
> Andy
>  
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Thomas Pak
> Sent: Saturday, March 16, 2019 4:14 PM
> To: Open MPI Users 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear Jeff,
>  
> I did find a way to circumvent this issue for my specific application by 
> spawning less frequently. However, I wanted to at least bring attention to 
> this issue for the OpenMPI community, as it can be reproduced with an 
> alarmingly simple program.
>  
> Perhaps the user's mailing list is not the ideal place for this. Would you 
> recommend that I report this issue on the developer's mailing list or open a 
> GitHub issue?
>  
> Best wishes,
> Thomas Pak
>  
> On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
> Is there perhaps a different way to solve your problem that doesn’t spawn so 
> much as to hit this issue?
>  
> I’m not denying there’s an issue here, but in a world of finite human effort 
> and fallible software, sometimes it’s easiest to just avoid the bugs 
> altogether.
>  
> Jeff
>  
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak  wrote:
> Dear all,
>  
> Does anyone have any clue on what the problem could be here? This seems to be 
> a persistent problem present in all currently supported OpenMPI releases and 
> indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
> process creation.
>  
> Best wishes,
> Thomas Pak
>  
> From: "Thomas Pak" 
> To: users@lists.open-mpi.org
> Sent: Friday, 7 December, 2018 17:51:29
> Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>  
> Dear all,
>  
> My MPI application spawns a large number of MPI processes using 
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced 
> that this results in problems for all currently supported OpenMPI versions 
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in C 
> (included below) that spawns child processes using MPI_Comm_spawn in an 
> infinite loop, where each child process exits after writing a message to 
> stdout. This short program leads to the following issues:
>  
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
> program leads to a pipe leak where pipes keep accumulating over time until my 
> MPI application crashes because the maximum number of pipes has been reached.
>  
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be 
> no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>  
> In version 4.0.0 (compiled from source), I have not been able to test this 
> issue very thoroughly because mpiexec ignores the --oversubscribe 
> command-line flag (as detailed in this GitHub issue 
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
> oversubscription of processor cores, which means that spawning additional 
> processes immediately results in an error because "not enough slots" are 
> available. A fix for this was proposed recently 
> (https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
> branch is being actively developed right now, I decided not go into it.
>  
> I have found one e-mail thread on this mailing list about a similar problem 
> (https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In 
> this thread, Ralph Castain states that this is a known issue and suggests 
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3 is 
> no longer supported and the issue has reappeared, hence this did not solve 
> the issue.
>  
> I have created a GitHub gist that contains the output from "ompi_info --all" 
> of all the OpenMPI installations mentioned here, as well as the config.log 
> files for the OpenMPI installations that I compiled from source: 
> https://gist.github.com/ThomasPak/

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-17 Thread Riebs, Andy
Thomas, your test case is somewhat similar to a bash fork() bomb -- not the 
same, but similar. After running one of your failing jobs, you might check to 
see if the “out-of-memory” (“OOM”) killer has been invoked. If it has, that can 
lead to unexpected consequences, such as what you’ve reported.

An easy way to check would be
$ nodes=${ job’s node list }
$  pdsh  -w $nodes dmesg  -T \|  grep  \"Out of memory\" 2>/dev/null

Andy

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Thomas Pak
Sent: Saturday, March 16, 2019 4:14 PM
To: Open MPI Users 
Cc: Open MPI Users 
Subject: Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

Dear Jeff,

I did find a way to circumvent this issue for my specific application by 
spawning less frequently. However, I wanted to at least bring attention to this 
issue for the OpenMPI community, as it can be reproduced with an alarmingly 
simple program.

Perhaps the user's mailing list is not the ideal place for this. Would you 
recommend that I report this issue on the developer's mailing list or open a 
GitHub issue?

Best wishes,
Thomas Pak

On Mar 16 2019, at 7:40 pm, Jeff Hammond 
mailto:jeff.scie...@gmail.com>> wrote:
Is there perhaps a different way to solve your problem that doesn’t spawn so 
much as to hit this issue?

I’m not denying there’s an issue here, but in a world of finite human effort 
and fallible software, sometimes it’s easiest to just avoid the bugs altogether.

Jeff

On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak 
mailto:thomas@maths.ox.ac.uk>> wrote:
Dear all,

Does anyone have any clue on what the problem could be here? This seems to be a 
persistent problem present in all currently supported OpenMPI releases and 
indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
process creation.

Best wishes,
Thomas Pak

From: "Thomas Pak" mailto:thomas@maths.ox.ac.uk>>
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
Sent: Friday, 7 December, 2018 17:51:29
Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

Dear all,

My MPI application spawns a large number of MPI processes using MPI_Comm_spawn 
over its total lifetime. Unfortunately, I have experienced that this results in 
problems for all currently supported OpenMPI versions (2.1, 3.0, 3.1 and 4.0). 
I have written a short, self-contained program in C (included below) that 
spawns child processes using MPI_Comm_spawn in an infinite loop, where each 
child process exits after writing a message to stdout. This short program leads 
to the following issues:

In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
program leads to a pipe leak where pipes keep accumulating over time until my 
MPI application crashes because the maximum number of pipes has been reached.

In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be no 
pipe leak, but the program crashes with the following error message:
PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257

In version 4.0.0 (compiled from source), I have not been able to test this 
issue very thoroughly because mpiexec ignores the --oversubscribe command-line 
flag (as detailed in this GitHub issue 
https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
oversubscription of processor cores, which means that spawning additional 
processes immediately results in an error because "not enough slots" are 
available. A fix for this was proposed recently 
(https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
branch is being actively developed right now, I decided not go into it.

I have found one e-mail thread on this mailing list about a similar problem 
(https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In this 
thread, Ralph Castain states that this is a known issue and suggests that it is 
fixed in the then upcoming v1.3.x release. However, version 1.3 is no longer 
supported and the issue has reappeared, hence this did not solve the issue.

I have created a GitHub gist that contains the output from "ompi_info --all" of 
all the OpenMPI installations mentioned here, as well as the config.log files 
for the OpenMPI installations that I compiled from source: 
https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.

I have also attached the code for the short program that demonstrates these 
issues. For good measure, I have included it directly here as well:

"""
#include 
#include 

int main(int argc, char *argv[]) {

// Initialize MPI
MPI_Init(NULL, NULL);

// Get parent
MPI_Comm parent;
MPI_Comm_get_parent(&parent);

// If the process was not spawned
if (parent == MPI_COMM_NULL) {

puts("I was not spawned!");

// Spawn child process in loop
char *cmd = argv[0];
char **cmd_argv = MPI_ARGV_NULL;
int maxprocs = 1;
MPI_Info info = MPI_INFO_NULL;
int root 

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Ralph H Castain
FWIW: I just ran a cycle of 10,000 spawns on my Mac without a problem using 
OMPI master, so I believe this has been resolved. I don’t know if/when the 
required updates might come into the various release branches.

Ralph


> On Mar 16, 2019, at 1:13 PM, Thomas Pak  wrote:
> 
> Dear Jeff,
> 
> I did find a way to circumvent this issue for my specific application by 
> spawning less frequently. However, I wanted to at least bring attention to 
> this issue for the OpenMPI community, as it can be reproduced with an 
> alarmingly simple program.
> 
> Perhaps the user's mailing list is not the ideal place for this. Would you 
> recommend that I report this issue on the developer's mailing list or open a 
> GitHub issue?
> 
> Best wishes,
> Thomas Pak
> 
> On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
> Is there perhaps a different way to solve your problem that doesn’t spawn so 
> much as to hit this issue?
> 
> I’m not denying there’s an issue here, but in a world of finite human effort 
> and fallible software, sometimes it’s easiest to just avoid the bugs 
> altogether.
> 
> Jeff
> 
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak  <mailto:thomas@maths.ox.ac.uk>> wrote:
> Dear all,
> 
> Does anyone have any clue on what the problem could be here? This seems to be 
> a persistent problem present in all currently supported OpenMPI releases and 
> indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
> process creation.
> 
> Best wishes,
> Thomas Pak
> 
> From: "Thomas Pak"  <mailto:thomas@maths.ox.ac.uk>>
> To: users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> Sent: Friday, 7 December, 2018 17:51:29
> Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
> 
> Dear all,
> 
> My MPI application spawns a large number of MPI processes using 
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced 
> that this results in problems for all currently supported OpenMPI versions 
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in C 
> (included below) that spawns child processes using MPI_Comm_spawn in an 
> infinite loop, where each child process exits after writing a message to 
> stdout. This short program leads to the following issues:
> 
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
> program leads to a pipe leak where pipes keep accumulating over time until my 
> MPI application crashes because the maximum number of pipes has been reached.
> 
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be 
> no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
> 
> In version 4.0.0 (compiled from source), I have not been able to test this 
> issue very thoroughly because mpiexec ignores the --oversubscribe 
> command-line flag (as detailed in this GitHub issue 
> https://github.com/open-mpi/ompi/issues/6130 
> <https://github.com/open-mpi/ompi/issues/6130>). This prohibits the 
> oversubscription of processor cores, which means that spawning additional 
> processes immediately results in an error because "not enough slots" are 
> available. A fix for this was proposed recently 
> (https://github.com/open-mpi/ompi/pull/6139 
> <https://github.com/open-mpi/ompi/pull/6139>), but since the v4.0.x developer 
> branch is being actively developed right now, I decided not go into it.
> 
> I have found one e-mail thread on this mailing list about a similar problem 
> (https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html 
> <https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html>). In 
> this thread, Ralph Castain states that this is a known issue and suggests 
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3 is 
> no longer supported and the issue has reappeared, hence this did not solve 
> the issue.
> 
> I have created a GitHub gist that contains the output from "ompi_info --all" 
> of all the OpenMPI installations mentioned here, as well as the config.log 
> files for the OpenMPI installations that I compiled from source: 
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c 
> <https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c>.
> 
> I have also attached the code for the short program that demonstrates these 
> issues. For good measure, I have included it directly here as well:
> 
> """
> #include 
> #include 
> 
> int main(int argc, char *argv[]) {
> 
> // Initialize MPI
> MPI_Init(NULL, NULL);
>

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Thomas Pak
Dear Jeff,

I did find a way to circumvent this issue for my specific application by 
spawning less frequently. However, I wanted to at least bring attention to this 
issue for the OpenMPI community, as it can be reproduced with an alarmingly 
simple program.
Perhaps the user's mailing list is not the ideal place for this. Would you 
recommend that I report this issue on the developer's mailing list or open a 
GitHub issue?
Best wishes,
Thomas Pak

On Mar 16 2019, at 7:40 pm, Jeff Hammond  wrote:
> Is there perhaps a different way to solve your problem that doesn’t spawn so 
> much as to hit this issue?
>
>
> I’m not denying there’s an issue here, but in a world of finite human effort 
> and fallible software, sometimes it’s easiest to just avoid the bugs 
> altogether.
>
> Jeff
>
> On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak  (mailto:thomas@maths.ox.ac.uk)> wrote:
> > Dear all,
> >
> > Does anyone have any clue on what the problem could be here? This seems to 
> > be a persistent problem present in all currently supported OpenMPI releases 
> > and indicates that there is a fundamental flaw in how OpenMPI handles 
> > dynamic process creation.
> >
> > Best wishes,
> > Thomas Pak
> >
> >
> > From: "Thomas Pak"  > (mailto:thomas@maths.ox.ac.uk)>
> > To: users@lists.open-mpi.org (mailto:users@lists.open-mpi.org)
> > Sent: Friday, 7 December, 2018 17:51:29
> > Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
> >
> >
> >
> >
> > Dear all,
> > My MPI application spawns a large number of MPI processes using 
> > MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced 
> > that this results in problems for all currently supported OpenMPI versions 
> > (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in 
> > C (included below) that spawns child processes using MPI_Comm_spawn in an 
> > infinite loop, where each child process exits after writing a message to 
> > stdout. This short program leads to the following issues:
> > In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
> > program leads to a pipe leak where pipes keep accumulating over time until 
> > my MPI application crashes because the maximum number of pipes has been 
> > reached.
> > In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to 
> > be no pipe leak, but the program crashes with the following error message:
> > PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
> >
> > In version 4.0.0 (compiled from source), I have not been able to test this 
> > issue very thoroughly because mpiexec ignores the --oversubscribe 
> > command-line flag (as detailed in this GitHub issue 
> > https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
> > oversubscription of processor cores, which means that spawning additional 
> > processes immediately results in an error because "not enough slots" are 
> > available. A fix for this was proposed recently 
> > (https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x 
> > developer branch is being actively developed right now, I decided not go 
> > into it.
> > I have found one e-mail thread on this mailing list about a similar problem 
> > (https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In 
> > this thread, Ralph Castain states that this is a known issue and suggests 
> > that it is fixed in the then upcoming v1.3.x release. However, version 1.3 
> > is no longer supported and the issue has reappeared, hence this did not 
> > solve the issue.
> > I have created a GitHub gist that contains the output from "ompi_info 
> > --all" of all the OpenMPI installations mentioned here, as well as the 
> > config.log files for the OpenMPI installations that I compiled from source: 
> > https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
> > I have also attached the code for the short program that demonstrates these 
> > issues. For good measure, I have included it directly here as well:
> > """
> > #include 
> > #include 
> >
> > int main(int argc, char *argv[]) {
> > // Initialize MPI
> > MPI_Init(NULL, NULL);
> >
> > // Get parent
> > MPI_Comm parent;
> > MPI_Comm_get_parent(&parent);
> >
> > // If the process was not spawned
> > if (parent == MPI_COMM_NULL) {
> >
> > puts("I was not spawned!");
> > // Spawn child process in loop
> > char *cmd = argv[0];
> > char **cmd_argv = MPI_ARG

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Jeff Hammond
Is there perhaps a different way to solve your problem that doesn’t spawn
so much as to hit this issue?

I’m not denying there’s an issue here, but in a world of finite human
effort and fallible software, sometimes it’s easiest to just avoid the bugs
altogether.

Jeff

On Sat, Mar 16, 2019 at 12:11 PM Thomas Pak 
wrote:

> Dear all,
>
> Does anyone have any clue on what the problem could be here? This seems to
> be a persistent problem present in all currently supported OpenMPI releases
> and indicates that there is a fundamental flaw in how OpenMPI handles
> dynamic process creation.
>
> Best wishes,
> Thomas Pak
>
> --
> *From: *"Thomas Pak" 
> *To: *users@lists.open-mpi.org
> *Sent: *Friday, 7 December, 2018 17:51:29
> *Subject: *[OMPI users] MPI_Comm_spawn leads to pipe leak and other errors
>
> Dear all,
>
> My MPI application spawns a large number of MPI processes using
> MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced
> that this results in problems for all currently supported OpenMPI versions
> (2.1, 3.0, 3.1 and 4.0). I have written a short, self-contained program in
> C (included below) that spawns child processes using MPI_Comm_spawn in an
> infinite loop, where each child process exits after writing a message to
> stdout. This short program leads to the following issues:
>
> In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the
> program leads to a pipe leak where pipes keep accumulating over time until
> my MPI application crashes because the maximum number of pipes has been
> reached.
>
> In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to
> be no pipe leak, but the program crashes with the following error message:
> PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257
>
> In version 4.0.0 (compiled from source), I have not been able to test this
> issue very thoroughly because mpiexec ignores the --oversubscribe
> command-line flag (as detailed in this GitHub issue
> https://github.com/open-mpi/ompi/issues/6130). This prohibits the
> oversubscription of processor cores, which means that spawning additional
> processes immediately results in an error because "not enough slots" are
> available. A fix for this was proposed recently (
> https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x
> developer branch is being actively developed right now, I decided not go
> into it.
>
> I have found one e-mail thread on this mailing list about a similar
> problem (
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In
> this thread, Ralph Castain states that this is a known issue and suggests
> that it is fixed in the then upcoming v1.3.x release. However, version 1.3
> is no longer supported and the issue has reappeared, hence this did not
> solve the issue.
>
> I have created a GitHub gist that contains the output from "ompi_info
> --all" of all the OpenMPI installations mentioned here, as well as the
> config.log files for the OpenMPI installations that I compiled from source:
> https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
>
> I have also attached the code for the short program that demonstrates
> these issues. For good measure, I have included it directly here as well:
>
> """
> #include 
> #include 
>
> int main(int argc, char *argv[]) {
>
> // Initialize MPI
> MPI_Init(NULL, NULL);
>
> // Get parent
> MPI_Comm parent;
> MPI_Comm_get_parent(&parent);
>
> // If the process was not spawned
> if (parent == MPI_COMM_NULL) {
>
> puts("I was not spawned!");
>
> // Spawn child process in loop
> char *cmd = argv[0];
> char **cmd_argv = MPI_ARGV_NULL;
> int maxprocs = 1;
> MPI_Info info = MPI_INFO_NULL;
> int root = 0;
> MPI_Comm comm = MPI_COMM_SELF;
> MPI_Comm intercomm;
> int *array_of_errcodes = MPI_ERRCODES_IGNORE;
>
> for (;;) {
> MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
> &intercomm, array_of_errcodes);
>
> MPI_Comm_disconnect(&intercomm);
> }
>
> // If process was spawned
> } else {
>
> puts("I was spawned!");
>
> MPI_Comm_disconnect(&parent);
> }
>
> // Finalize
> MPI_Finalize();
>
> }
> """
>
> Thanks in advance and best wishes,
> Thomas Pak
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2019-03-16 Thread Thomas Pak
Dear all, 

Does anyone have any clue on what the problem could be here? This seems to be a 
persistent problem present in all currently supported OpenMPI releases and 
indicates that there is a fundamental flaw in how OpenMPI handles dynamic 
process creation. 

Best wishes, 
Thomas Pak 


From: "Thomas Pak"  
To: users@lists.open-mpi.org 
Sent: Friday, 7 December, 2018 17:51:29 
Subject: [OMPI users] MPI_Comm_spawn leads to pipe leak and other errors 

Dear all, 

My MPI application spawns a large number of MPI processes using MPI_Comm_spawn 
over its total lifetime. Unfortunately, I have experienced that this results in 
problems for all currently supported OpenMPI versions (2.1, 3.0, 3.1 and 4.0). 
I have written a short, self-contained program in C (included below) that 
spawns child processes using MPI_Comm_spawn in an infinite loop, where each 
child process exits after writing a message to stdout. This short program leads 
to the following issues: 

In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
program leads to a pipe leak where pipes keep accumulating over time until my 
MPI application crashes because the maximum number of pipes has been reached. 

In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be no 
pipe leak, but the program crashes with the following error message: 
PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257 

In version 4.0.0 (compiled from source), I have not been able to test this 
issue very thoroughly because mpiexec ignores the --oversubscribe command-line 
flag (as detailed in this GitHub issue [ 
https://github.com/open-mpi/ompi/issues/6130 | 
https://github.com/open-mpi/ompi/issues/6130 ] ). This prohibits the 
oversubscription of processor cores, which means that spawning additional 
processes immediately results in an error because "not enough slots" are 
available. A fix for this was proposed recently ( [ 
https://github.com/open-mpi/ompi/pull/6139 | 
https://github.com/open-mpi/ompi/pull/6139 ] ), but since the v4.0.x developer 
branch is being actively developed right now, I decided not go into it. 

I have found one e-mail thread on this mailing list about a similar problem ( [ 
https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html | 
https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html ] ). In 
this thread, Ralph Castain states that this is a known issue and suggests that 
it is fixed in the then upcoming v1.3.x release. However, version 1.3 is no 
longer supported and the issue has reappeared, hence this did not solve the 
issue. 

I have created a GitHub gist that contains the output from "ompi_info --all" of 
all the OpenMPI installations mentioned here, as well as the config.log files 
for the OpenMPI installations that I compiled from source: [ 
https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c | 
https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c ] . 

I have also attached the code for the short program that demonstrates these 
issues. For good measure, I have included it directly here as well: 

""" 
#include  
#include  

int main(int argc, char *argv[]) { 

// Initialize MPI 
MPI_Init(NULL, NULL); 

// Get parent 
MPI_Comm parent; 
MPI_Comm_get_parent(&parent); 

// If the process was not spawned 
if (parent == MPI_COMM_NULL) { 

puts("I was not spawned!"); 

// Spawn child process in loop 
char *cmd = argv[0]; 
char **cmd_argv = MPI_ARGV_NULL; 
int maxprocs = 1; 
MPI_Info info = MPI_INFO_NULL; 
int root = 0; 
MPI_Comm comm = MPI_COMM_SELF; 
MPI_Comm intercomm; 
int *array_of_errcodes = MPI_ERRCODES_IGNORE; 

for (;;) { 
MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm, 
&intercomm, array_of_errcodes); 

MPI_Comm_disconnect(&intercomm); 
} 

// If process was spawned 
} else { 

puts("I was spawned!"); 

MPI_Comm_disconnect(&parent); 
} 

// Finalize 
MPI_Finalize(); 

} 
""" 

Thanks in advance and best wishes, 
Thomas Pak 

___ 
users mailing list 
users@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/users 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] MPI_Comm_spawn leads to pipe leak and other errors

2018-12-07 Thread Thomas Pak
Dear all,

My MPI application spawns a large number of MPI processes using MPI_Comm_spawn 
over its total lifetime. Unfortunately, I have experienced that this results in 
problems for all currently supported OpenMPI versions (2.1, 3.0, 3.1 and 4.0). 
I have written a short, self-contained program in C (included below) that 
spawns child processes using MPI_Comm_spawn in an infinite loop, where each 
child process exits after writing a message to stdout. This short program leads 
to the following issues:
In versions 2.1.2 (Ubuntu package) and 2.1.5 (compiled from source), the 
program leads to a pipe leak where pipes keep accumulating over time until my 
MPI application crashes because the maximum number of pipes has been reached.
In versions 3.0.3 and 3.1.3 (both compiled from source), there appears to be no 
pipe leak, but the program crashes with the following error message:
PMIX_ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1257

In version 4.0.0 (compiled from source), I have not been able to test this 
issue very thoroughly because mpiexec ignores the --oversubscribe command-line 
flag (as detailed in this GitHub issue 
https://github.com/open-mpi/ompi/issues/6130). This prohibits the 
oversubscription of processor cores, which means that spawning additional 
processes immediately results in an error because "not enough slots" are 
available. A fix for this was proposed recently 
(https://github.com/open-mpi/ompi/pull/6139), but since the v4.0.x developer 
branch is being actively developed right now, I decided not go into it.
I have found one e-mail thread on this mailing list about a similar problem 
(https://www.mail-archive.com/users@lists.open-mpi.org/msg10543.html). In this 
thread, Ralph Castain states that this is a known issue and suggests that it is 
fixed in the then upcoming v1.3.x release. However, version 1.3 is no longer 
supported and the issue has reappeared, hence this did not solve the issue.
I have created a GitHub gist that contains the output from "ompi_info --all" of 
all the OpenMPI installations mentioned here, as well as the config.log files 
for the OpenMPI installations that I compiled from source: 
https://gist.github.com/ThomasPak/1003160e396bb88dff27e53c53121e0c.
I have also attached the code for the short program that demonstrates these 
issues. For good measure, I have included it directly here as well:
"""
#include 
#include 

int main(int argc, char *argv[]) {
// Initialize MPI
MPI_Init(NULL, NULL);

// Get parent
MPI_Comm parent;
MPI_Comm_get_parent(&parent);

// If the process was not spawned
if (parent == MPI_COMM_NULL) {

puts("I was not spawned!");
// Spawn child process in loop
char *cmd = argv[0];
char **cmd_argv = MPI_ARGV_NULL;
int maxprocs = 1;
MPI_Info info = MPI_INFO_NULL;
int root = 0;
MPI_Comm comm = MPI_COMM_SELF;
MPI_Comm intercomm;
int *array_of_errcodes = MPI_ERRCODES_IGNORE;

for (;;) {
MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
&intercomm, array_of_errcodes);

MPI_Comm_disconnect(&intercomm);
}

// If process was spawned
} else {

puts("I was spawned!");
MPI_Comm_disconnect(&parent);
}

// Finalize
MPI_Finalize();

}
"""

Thanks in advance and best wishes,
Thomas Pak


mpi-spawn.c
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn question

2017-02-04 Thread Gilles Gouaillardet
Andrew,

the 2 seconds timeout is very likely a bug that was fixed, so i strongly
suggest you give a try to the latest 2.0.2 that was released earlier this
week.

Ralph is referring an other timeout which is hard coded (fwiw, the MPI
standard says nothing about timeout, so we hardcoded one to prevent jobs
from hanging forever) to 600 seconds in master, but is still 60 seconds in
the v2.0.x branch
IIRC, the hard coded timeout is in MPI_Comm_{accept,connect} and i do not
know if it is somehow involved in MPI_Comm_spawn.

Cheers,

Gilles

On Saturday, February 4, 2017, r...@open-mpi.org  wrote:

> We know v2.0.1 has problems with comm_spawn, and so you may be
> encountering one of those. Regardless, there is indeed a timeout mechanism
> in there. It was added because people would execute a comm_spawn, and then
> would hang and eat up their entire allocation time for nothing.
>
> In v2.0.2, I see it is still hardwired at 60 seconds. I believe we
> eventually realized we needed to make that a variable, but it didn’t get
> into the 2.0.2 release.
>
>
> > On Feb 1, 2017, at 1:00 AM, elistrato...@info.sgu.ru 
> wrote:
> >
> > I am using Open MPI version 2.0.1.
> > ___
> > users mailing list
> > users@lists.open-mpi.org 
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn question

2017-02-03 Thread r...@open-mpi.org
We know v2.0.1 has problems with comm_spawn, and so you may be encountering one 
of those. Regardless, there is indeed a timeout mechanism in there. It was 
added because people would execute a comm_spawn, and then would hang and eat up 
their entire allocation time for nothing.

In v2.0.2, I see it is still hardwired at 60 seconds. I believe we eventually 
realized we needed to make that a variable, but it didn’t get into the 2.0.2 
release.


> On Feb 1, 2017, at 1:00 AM, elistrato...@info.sgu.ru wrote:
> 
> I am using Open MPI version 2.0.1.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn question

2017-02-01 Thread elistratovaa
I am using Open MPI version 2.0.1.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI_Comm_spawn question

2017-01-31 Thread r...@open-mpi.org
What version of OMPI are you using?

> On Jan 31, 2017, at 7:33 AM, elistrato...@info.sgu.ru wrote:
> 
> Hi,
> 
> I am trying to write trivial master-slave program. Master simply creates
> slaves, sends them a string, they print it out and exit. Everything works
> just fine, however, when I add a delay (more than 2 sec) before calling
> MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that
> MPI_Comm_spawn has some kind of timeout on waiting for slaves to call
> MPI_Init, and if they fail to respond in time, it returns an error.
> 
> I believe there is a way to change this behaviour, but I wasn't able to
> find any suggestions/ideas in the internet.
> I would appreciate if someone could help with this.
> 
> ---
> --- terminal command i use to run program:
> mpirun -n 1 hello 2 2 // the first argument to "hello" is number of
> slaves, the second is delay in seconds
> 
> --- Error message I get when delay is >=2 sec:
> [host:2231] *** An error occurred in MPI_Comm_spawn
> [host:2231] *** reported by process [3453419521,0]
> [host:2231] *** on communicator MPI_COMM_SELF
> [host:2231] *** MPI_ERR_SPAWN: could not spawn processes
> [host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [host:2231] ***and potentially your MPI job)
> 
> --- The program itself:
> #include "stdlib.h"
> #include "stdio.h"
> #include "mpi.h"
> #include "unistd.h"
> 
> MPI_Comm slave_comm;
> MPI_Comm new_world;
> #define MESSAGE_SIZE 40
> 
> void slave() {
>   printf("Slave initialized; ");
>   MPI_Comm_get_parent(&slave_comm);
>   MPI_Intercomm_merge(slave_comm, 1, &new_world);
> 
>   int slave_rank;
>   MPI_Comm_rank(new_world, &slave_rank);
> 
>   char message[MESSAGE_SIZE];
>   MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
> 
>   printf("Slave %d received message from master: %s\n", slave_rank, 
> message);
> }
> 
> void master(int slave_count, char* executable, char* delay) {
>   char* slave_argv[] = { delay, NULL };
>   MPI_Comm_spawn( executable,
>   slave_argv,
>   slave_count,
>   MPI_INFO_NULL,
>   0,
>   MPI_COMM_SELF,
>   &slave_comm,
>   MPI_ERRCODES_IGNORE);
>   MPI_Intercomm_merge(slave_comm, 0, &new_world);
>   char* helloWorld = "Hello New World!\0";
>   MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
>   printf("Processes spawned!\n");
> }
> 
> int main(int argc, char* argv[]) {
>   if (argc > 2) {
>   MPI_Init(&argc, &argv);
>   master(atoi(argv[1]), argv[0], argv[2]);
>   } else {
>   sleep(atoi(argv[1])); /// delay
>   MPI_Init(&argc, &argv);
>   slave();
>   }
>   MPI_Comm_free(&new_world);
>   MPI_Comm_free(&slave_comm);
>   MPI_Finalize();
> }
> 
> 
> Thank you,
> 
> Andrew Elistratov
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] MPI_Comm_spawn question

2017-01-31 Thread elistratovaa
Hi,

I am trying to write trivial master-slave program. Master simply creates
slaves, sends them a string, they print it out and exit. Everything works
just fine, however, when I add a delay (more than 2 sec) before calling
MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that
MPI_Comm_spawn has some kind of timeout on waiting for slaves to call
MPI_Init, and if they fail to respond in time, it returns an error.

I believe there is a way to change this behaviour, but I wasn't able to
find any suggestions/ideas in the internet.
I would appreciate if someone could help with this.

---
--- terminal command i use to run program:
mpirun -n 1 hello 2 2 // the first argument to "hello" is number of
slaves, the second is delay in seconds

--- Error message I get when delay is >=2 sec:
[host:2231] *** An error occurred in MPI_Comm_spawn
[host:2231] *** reported by process [3453419521,0]
[host:2231] *** on communicator MPI_COMM_SELF
[host:2231] *** MPI_ERR_SPAWN: could not spawn processes
[host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[host:2231] ***and potentially your MPI job)

--- The program itself:
#include "stdlib.h"
#include "stdio.h"
#include "mpi.h"
#include "unistd.h"

MPI_Comm slave_comm;
MPI_Comm new_world;
#define MESSAGE_SIZE 40

void slave() {
printf("Slave initialized; ");
MPI_Comm_get_parent(&slave_comm);
MPI_Intercomm_merge(slave_comm, 1, &new_world);

int slave_rank;
MPI_Comm_rank(new_world, &slave_rank);

char message[MESSAGE_SIZE];
MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world);

printf("Slave %d received message from master: %s\n", slave_rank, 
message);
}

void master(int slave_count, char* executable, char* delay) {
char* slave_argv[] = { delay, NULL };
MPI_Comm_spawn( executable,
slave_argv,
slave_count,
MPI_INFO_NULL,
0,
MPI_COMM_SELF,
&slave_comm,
MPI_ERRCODES_IGNORE);
MPI_Intercomm_merge(slave_comm, 0, &new_world);
char* helloWorld = "Hello New World!\0";
MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
printf("Processes spawned!\n");
}

int main(int argc, char* argv[]) {
if (argc > 2) {
MPI_Init(&argc, &argv);
master(atoi(argv[1]), argv[0], argv[2]);
} else {
sleep(atoi(argv[1])); /// delay
MPI_Init(&argc, &argv);
slave();
}
MPI_Comm_free(&new_world);
MPI_Comm_free(&slave_comm);
MPI_Finalize();
}


Thank you,

Andrew Elistratov


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI_Comm_spawn

2016-09-29 Thread Cabral, Matias A
Hi Giles et.al.,

You are right, ptl.c is in PSM2 code. As Ralph mentions, dynamic process 
support was/is not working in OMPI when using PSM2 because of an issue related 
to the transport keys. This was fixed in PR #1602 
(https://github.com/open-mpi/ompi/pull/1602) and should be included in v2.0.2. 
HOWEVER, this not the error Juraj is seeing. The root of the assertion is 
because the PSM/PSM2 MTLs will check for where the “original” process are 
running and, if detects all are local to the node, it will ONLY initialize the 
shared memory device (variable PSM2_DEVICES="self,shm” ). This is to avoid 
“reserving” HW resources in the HFI card that wouldn’t be used unless you later 
on spawn ranks in other nodes.  Therefore, to allow dynamic process to be 
spawned on other nodes you need to tell PSM2 to instruct the HW to initialize 
all the de devices by making the environment variable 
PSM2_DEVICES="self,shm,hfi" available before running the job.
Note that setting PSM2_DEVICES (*) will solve the below assertion, you will 
most likely still see the transport key issue if PR1602 if is not included.

Thanks,

_MAC

(*)
PSM2_DEVICES  -> Omni Path
PSM_DEVICES  -> TrueScale

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, September 29, 2016 7:12 AM
To: Open MPI Users 
Subject: Re: [OMPI users] MPI_Comm_spawn

Ah, that may be why it wouldn’t show up in the OMPI code base itself. If that 
is the case here, then no - OMPI v2.0.1 does not support comm_spawn for PSM. It 
is fixed in the upcoming 2.0.2

On Sep 29, 2016, at 6:58 AM, Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> wrote:

Ralph,

My guess is that ptl.c comes from PSM lib ...

Cheers,

Gilles

On Thursday, September 29, 2016, r...@open-mpi.org<mailto:r...@open-mpi.org> 
mailto:r...@open-mpi.org>> wrote:
Spawn definitely does not work with srun. I don’t recognize the name of the 
file that segfaulted - what is “ptl.c”? Is that in your manager program?


On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet 
>
 wrote:

Hi,

I do not expect spawn can work with direct launch (e.g. srun)

Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the failure

Can you please try

mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts ./manager 1

and see if it help ?

Note if you have the possibility, I suggest you first try that without slurm, 
and then within a slurm job

Cheers,

Gilles

On Thursday, September 29, 2016, 
juraj2...@gmail.com 
> 
wrote:
Hello,

I am using MPI_Comm_spawn to dynamically create new processes from single 
manager process. Everything works fine when all the processes are running on 
the same node. But imposing restriction to run only a single process per node 
does not work. Below are the errors produced during multinode interactive 
session and multinode sbatch job.

The system I am using is: Linux version 3.10.0-229.el7.x86_64 
(buil...@kbuilder.dev.centos.org<mailto:buil...@kbuilder.dev.centos.org>) (gcc 
version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
I am using Open MPI 2.0.1
Slurm is version 15.08.9

What is preventing my jobs to spawn on multiple nodes? Does slurm requires some 
additional configuration to allow it? Is it issue on the MPI side, does it need 
to be compiled with some special flag (I have compiled it with 
--enable-mpi-fortran=all --with-pmi)?

The code I am launching is here: https://github.com/goghino/dynamicMPI

Manager tries to launch one new process (./manager 1), the error produced by 
requesting each process to be located on different node (interactive session):
$ salloc -N 2
$ cat my_hosts
icsnode37
icsnode38
$ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode37
icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
[icsnode37:12614] *** Process received signal ***
[icsnode37:12614] Signal: Aborted (6)
[icsnode37:12614] Signal code:  (-6)
[icsnode38:32443] *** Process received signal ***
[icsnode38:32443] Signal: Aborted (6)
[icsnode38:32443] Signal code:  (-6)

The same example as above via sbatch job submission:
$ cat job.sbatch
#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1

module load openmpi/2.0.1
srun -n 1 -N 1 ./manager 1

$ cat output.o
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode39
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[icsnode39:9692] *** An error occurred in MPI_Comm_spawn
[icsnode39:9692] *** reported by process [1007812608,0]
[icsnode39:9692] *** on communicator MPI_COMM_SELF
[icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
[icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[icsnode39:9692] ***and potentially your MPI job)
In: PMI_Abort(50, N/A)
slurmstepd: *

[OMPI users] MPI_Comm_spawn

2016-09-29 Thread juraj2...@gmail.com
The solution was to use the "tcp", "sm" and "self" BTLs for the transport
of MPI messages, with TCP restricting only the eth0 interface to
communicate and using ob1 as p2p management layer:

mpirun --mca btl_tcp_if_include eth0 --mca pml ob1 --mca btl tcp,sm,self
-np 1 --hostfile my_hosts ./manager 1

​Thank you for your help!​
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn

2016-09-29 Thread r...@open-mpi.org
Ah, that may be why it wouldn’t show up in the OMPI code base itself. If that 
is the case here, then no - OMPI v2.0.1 does not support comm_spawn for PSM. It 
is fixed in the upcoming 2.0.2

> On Sep 29, 2016, at 6:58 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> My guess is that ptl.c comes from PSM lib ...
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, September 29, 2016, r...@open-mpi.org  
> mailto:r...@open-mpi.org>> wrote:
> Spawn definitely does not work with srun. I don’t recognize the name of the 
> file that segfaulted - what is “ptl.c”? Is that in your manager program?
> 
> 
>> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet 
>> > > wrote:
>> 
>> Hi,
>> 
>> I do not expect spawn can work with direct launch (e.g. srun)
>> 
>> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the 
>> failure
>> 
>> Can you please try
>> 
>> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts 
>> ./manager 1
>> 
>> and see if it help ?
>> 
>> Note if you have the possibility, I suggest you first try that without 
>> slurm, and then within a slurm job
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Thursday, September 29, 2016, juraj2...@gmail.com 
>>  > > wrote:
>> Hello,
>> 
>> I am using MPI_Comm_spawn to dynamically create new processes from single 
>> manager process. Everything works fine when all the processes are running on 
>> the same node. But imposing restriction to run only a single process per 
>> node does not work. Below are the errors produced during multinode 
>> interactive session and multinode sbatch job.
>> 
>> The system I am using is: Linux version 3.10.0-229.el7.x86_64 
>> (buil...@kbuilder.dev.centos.org <>) (gcc version 4.8.2 20140120 (Red Hat 
>> 4.8.2-16) (GCC) )
>> I am using Open MPI 2.0.1
>> Slurm is version 15.08.9
>> 
>> What is preventing my jobs to spawn on multiple nodes? Does slurm requires 
>> some additional configuration to allow it? Is it issue on the MPI side, does 
>> it need to be compiled with some special flag (I have compiled it with 
>> --enable-mpi-fortran=all --with-pmi)? 
>> 
>> The code I am launching is here: https://github.com/goghino/dynamicMPI 
>> 
>> 
>> Manager tries to launch one new process (./manager 1), the error produced by 
>> requesting each process to be located on different node (interactive 
>> session):
>> $ salloc -N 2
>> $ cat my_hosts
>> icsnode37
>> icsnode38
>> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode37
>> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> [icsnode37:12614] *** Process received signal ***
>> [icsnode37:12614] Signal: Aborted (6)
>> [icsnode37:12614] Signal code:  (-6)
>> [icsnode38:32443] *** Process received signal ***
>> [icsnode38:32443] Signal: Aborted (6)
>> [icsnode38:32443] Signal code:  (-6)
>> 
>> The same example as above via sbatch job submission:
>> $ cat job.sbatch
>> #!/bin/bash
>> 
>> #SBATCH --nodes=2
>> #SBATCH --ntasks-per-node=1
>> 
>> module load openmpi/2.0.1
>> srun -n 1 -N 1 ./manager 1
>> 
>> $ cat output.o
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode39
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
>> [icsnode39:9692] *** reported by process [1007812608,0]
>> [icsnode39:9692] *** on communicator MPI_COMM_SELF
>> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
>> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>> will now abort,
>> [icsnode39:9692] ***and potentially your MPI job)
>> In: PMI_Abort(50, N/A)
>> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 
>> ***
>> srun: error: icsnode39: task 0: Exited with exit code 50
>> 
>> Thank for any feedback!
>> 
>> Best regards,
>> Juraj
>> ___
>> users mailing list
>> users@lists.open-mpi.org 
>> 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn

2016-09-29 Thread Gilles Gouaillardet
Ralph,

My guess is that ptl.c comes from PSM lib ...

Cheers,

Gilles

On Thursday, September 29, 2016, r...@open-mpi.org  wrote:

> Spawn definitely does not work with srun. I don’t recognize the name of
> the file that segfaulted - what is “ptl.c”? Is that in your manager program?
>
>
> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> > wrote:
>
> Hi,
>
> I do not expect spawn can work with direct launch (e.g. srun)
>
> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the
> failure
>
> Can you please try
>
> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts
> ./manager 1
>
> and see if it help ?
>
> Note if you have the possibility, I suggest you first try that without
> slurm, and then within a slurm job
>
> Cheers,
>
> Gilles
>
> On Thursday, September 29, 2016, juraj2...@gmail.com
>   > wrote:
>
>> Hello,
>>
>> I am using MPI_Comm_spawn to dynamically create new processes from single
>> manager process. Everything works fine when all the processes are running
>> on the same node. But imposing restriction to run only a single process per
>> node does not work. Below are the errors produced during multinode
>> interactive session and multinode sbatch job.
>>
>> The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
>> buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
>> 4.8.2-16) (GCC) )
>> I am using Open MPI 2.0.1
>> Slurm is version 15.08.9
>>
>> What is preventing my jobs to spawn on multiple nodes? Does slurm
>> requires some additional configuration to allow it? Is it issue on the MPI
>> side, does it need to be compiled with some special flag (I have compiled
>> it with --enable-mpi-fortran=all --with-pmi)?
>>
>> The code I am launching is here: https://github.com/goghino/dynamicMPI
>>
>> Manager tries to launch one new process (./manager 1), the error produced
>> by requesting each process to be located on different node (interactive
>> session):
>> $ salloc -N 2
>> $ cat my_hosts
>> icsnode37
>> icsnode38
>> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode37
>> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> [icsnode37:12614] *** Process received signal ***
>> [icsnode37:12614] Signal: Aborted (6)
>> [icsnode37:12614] Signal code:  (-6)
>> [icsnode38:32443] *** Process received signal ***
>> [icsnode38:32443] Signal: Aborted (6)
>> [icsnode38:32443] Signal code:  (-6)
>>
>> The same example as above via sbatch job submission:
>> $ cat job.sbatch
>> #!/bin/bash
>>
>> #SBATCH --nodes=2
>> #SBATCH --ntasks-per-node=1
>>
>> module load openmpi/2.0.1
>> srun -n 1 -N 1 ./manager 1
>>
>> $ cat output.o
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode39
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
>> [icsnode39:9692] *** reported by process [1007812608,0]
>> [icsnode39:9692] *** on communicator MPI_COMM_SELF
>> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
>> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [icsnode39:9692] ***and potentially your MPI job)
>> In: PMI_Abort(50, N/A)
>> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT
>> 2016-09-26T16:48:20 ***
>> srun: error: icsnode39: task 0: Exited with exit code 50
>>
>> Thank for any feedback!
>>
>> Best regards,
>> Juraj
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn

2016-09-29 Thread r...@open-mpi.org
Spawn definitely does not work with srun. I don’t recognize the name of the 
file that segfaulted - what is “ptl.c”? Is that in your manager program?


> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet 
>  wrote:
> 
> Hi,
> 
> I do not expect spawn can work with direct launch (e.g. srun)
> 
> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the 
> failure
> 
> Can you please try
> 
> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts 
> ./manager 1
> 
> and see if it help ?
> 
> Note if you have the possibility, I suggest you first try that without slurm, 
> and then within a slurm job
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, September 29, 2016, juraj2...@gmail.com 
>   > wrote:
> Hello,
> 
> I am using MPI_Comm_spawn to dynamically create new processes from single 
> manager process. Everything works fine when all the processes are running on 
> the same node. But imposing restriction to run only a single process per node 
> does not work. Below are the errors produced during multinode interactive 
> session and multinode sbatch job.
> 
> The system I am using is: Linux version 3.10.0-229.el7.x86_64 
> (buil...@kbuilder.dev.centos.org 
> ) (gcc 
> version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
> I am using Open MPI 2.0.1
> Slurm is version 15.08.9
> 
> What is preventing my jobs to spawn on multiple nodes? Does slurm requires 
> some additional configuration to allow it? Is it issue on the MPI side, does 
> it need to be compiled with some special flag (I have compiled it with 
> --enable-mpi-fortran=all --with-pmi)? 
> 
> The code I am launching is here: https://github.com/goghino/dynamicMPI 
> 
> 
> Manager tries to launch one new process (./manager 1), the error produced by 
> requesting each process to be located on different node (interactive session):
> $ salloc -N 2
> $ cat my_hosts
> icsnode37
> icsnode38
> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode37
> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
> [icsnode37:12614] *** Process received signal ***
> [icsnode37:12614] Signal: Aborted (6)
> [icsnode37:12614] Signal code:  (-6)
> [icsnode38:32443] *** Process received signal ***
> [icsnode38:32443] Signal: Aborted (6)
> [icsnode38:32443] Signal code:  (-6)
> 
> The same example as above via sbatch job submission:
> $ cat job.sbatch
> #!/bin/bash
> 
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=1
> 
> module load openmpi/2.0.1
> srun -n 1 -N 1 ./manager 1
> 
> $ cat output.o
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode39
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
> [icsnode39:9692] *** reported by process [1007812608,0]
> [icsnode39:9692] *** on communicator MPI_COMM_SELF
> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [icsnode39:9692] ***and potentially your MPI job)
> In: PMI_Abort(50, N/A)
> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20 ***
> srun: error: icsnode39: task 0: Exited with exit code 50
> 
> Thank for any feedback!
> 
> Best regards,
> Juraj
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn

2016-09-29 Thread Gilles Gouaillardet
Hi,

I do not expect spawn can work with direct launch (e.g. srun)

Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the
failure

Can you please try

mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts
./manager 1

and see if it help ?

Note if you have the possibility, I suggest you first try that without
slurm, and then within a slurm job

Cheers,

Gilles

On Thursday, September 29, 2016, juraj2...@gmail.com 
wrote:

> Hello,
>
> I am using MPI_Comm_spawn to dynamically create new processes from single
> manager process. Everything works fine when all the processes are running
> on the same node. But imposing restriction to run only a single process per
> node does not work. Below are the errors produced during multinode
> interactive session and multinode sbatch job.
>
> The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
> buil...@kbuilder.dev.centos.org
> ) (gcc
> version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) )
> I am using Open MPI 2.0.1
> Slurm is version 15.08.9
>
> What is preventing my jobs to spawn on multiple nodes? Does slurm requires
> some additional configuration to allow it? Is it issue on the MPI side,
> does it need to be compiled with some special flag (I have compiled it with
> --enable-mpi-fortran=all --with-pmi)?
>
> The code I am launching is here: https://github.com/goghino/dynamicMPI
>
> Manager tries to launch one new process (./manager 1), the error produced
> by requesting each process to be located on different node (interactive
> session):
> $ salloc -N 2
> $ cat my_hosts
> icsnode37
> icsnode38
> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode37
> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
> [icsnode37:12614] *** Process received signal ***
> [icsnode37:12614] Signal: Aborted (6)
> [icsnode37:12614] Signal code:  (-6)
> [icsnode38:32443] *** Process received signal ***
> [icsnode38:32443] Signal: Aborted (6)
> [icsnode38:32443] Signal code:  (-6)
>
> The same example as above via sbatch job submission:
> $ cat job.sbatch
> #!/bin/bash
>
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=1
>
> module load openmpi/2.0.1
> srun -n 1 -N 1 ./manager 1
>
> $ cat output.o
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode39
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
> [icsnode39:9692] *** reported by process [1007812608,0]
> [icsnode39:9692] *** on communicator MPI_COMM_SELF
> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [icsnode39:9692] ***and potentially your MPI job)
> In: PMI_Abort(50, N/A)
> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20
> ***
> srun: error: icsnode39: task 0: Exited with exit code 50
>
> Thank for any feedback!
>
> Best regards,
> Juraj
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI_Comm_spawn

2016-09-29 Thread juraj2...@gmail.com
Hello,

I am using MPI_Comm_spawn to dynamically create new processes from single
manager process. Everything works fine when all the processes are running
on the same node. But imposing restriction to run only a single process per
node does not work. Below are the errors produced during multinode
interactive session and multinode sbatch job.

The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
4.8.2-16) (GCC) )
I am using Open MPI 2.0.1
Slurm is version 15.08.9

What is preventing my jobs to spawn on multiple nodes? Does slurm requires
some additional configuration to allow it? Is it issue on the MPI side,
does it need to be compiled with some special flag (I have compiled it with
--enable-mpi-fortran=all --with-pmi)?

The code I am launching is here: https://github.com/goghino/dynamicMPI

Manager tries to launch one new process (./manager 1), the error produced
by requesting each process to be located on different node (interactive
session):
$ salloc -N 2
$ cat my_hosts
icsnode37
icsnode38
$ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode37
icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
[icsnode37:12614] *** Process received signal ***
[icsnode37:12614] Signal: Aborted (6)
[icsnode37:12614] Signal code:  (-6)
[icsnode38:32443] *** Process received signal ***
[icsnode38:32443] Signal: Aborted (6)
[icsnode38:32443] Signal code:  (-6)

The same example as above via sbatch job submission:
$ cat job.sbatch
#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1

module load openmpi/2.0.1
srun -n 1 -N 1 ./manager 1

$ cat output.o
[manager]I'm running MPI 3.1
[manager]Runing on node icsnode39
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[icsnode39:9692] *** An error occurred in MPI_Comm_spawn
[icsnode39:9692] *** reported by process [1007812608,0]
[icsnode39:9692] *** on communicator MPI_COMM_SELF
[icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
[icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[icsnode39:9692] ***and potentially your MPI job)
In: PMI_Abort(50, N/A)
slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 2016-09-26T16:48:20
***
srun: error: icsnode39: task 0: Exited with exit code 50

Thank for any feedback!

Best regards,
Juraj
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn and shared memory

2015-05-14 Thread Radoslaw Martyniszyn
Hi Gilles,
Thanks for your answer.
BR,
Radek

On Thu, May 14, 2015 at 9:12 AM, Gilles Gouaillardet 
wrote:

>  This is a known limitation of the sm btl.
>
> FWIW, the vader btl (available in Open MPI 1.8) has the same limitation,
> thought i heard there are some works in progress to get rid of this
> limitation.
>
> Cheers,
>
> Gilles
>
>
> On 5/14/2015 3:52 PM, Radoslaw Martyniszyn wrote:
>
>  Dear developers of Open MPI,
>
>  I've created two applications: parent and child. Parent spawns children
> using MPI_Comm_spawn. I would like to use shared memory when they
> communicate. However, applications do not start when I try using sm. Please
> comment on that issue. If this feature is not supported, are there any
> plans to add support? Also, are there any examples showing MPI_Comm_spawn
> and shared memory?
>
> I am using Open MPI 1.6.5 on Ubuntu. Both applications are run locally on
> the same host.
>
> // Works fine
> mpirun --mca btl self,tcp ./parent
>
> // Application terminates
> mpirun --mca btl self,sm ./parent
>
> "At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL."
>
> Below are code snippets:
>
> parent.cc:
> #include 
> #include 
>
> int main(int argc, char** argv) {
>   MPI_Init(NULL, NULL);
>
>   std::string lProgram = "./child";
>   MPI_Comm lIntercomm;
>   int lRv;
>   lRv = MPI_Comm_spawn( const_cast< char* >(lProgram.c_str()),
> MPI_ARGV_NULL, 3,
>MPI_INFO_NULL, 0, MPI_COMM_WORLD, &lIntercomm,
>MPI_ERRCODES_IGNORE);
>
>   if ( MPI_SUCCESS == lRv) {
>   std::cout << "SPAWN SUCCESS" << std::endl;
>   sleep(10);
>   }
>   else {
>   std::cout << "SPAWN ERROR " << lRv << std::endl;
>   }
>
>   MPI_Finalize();
> }
>
>  child.cc:
> #include 
> #include 
> #include 
>
> int main(int argc, char** argv) {
>   // Initialize the MPI environment
>   MPI_Init(NULL, NULL);
>
>   std::cout << "CHILD" << std::endl;
>   sleep(10);
>
>   MPI_Finalize();
> }
>
>  makefile (note, there are tabs not spaces preceding each target):
>  EXECS=child parent
> MPICC?=mpic++
>
> all: ${EXECS}
>
> child: child.cc
> ${MPICC} -o child child.cc
>
> parent: parent.cc
> ${MPICC} -o parent parent.cc
>
> clean:
> rm -f ${EXECS}
>
>
>  Greetings to all of you,
>  Radek Martyniszyn
>
>
>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26865.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26866.php
>


Re: [OMPI users] MPI_Comm_spawn and shared memory

2015-05-14 Thread Gilles Gouaillardet

This is a known limitation of the sm btl.

FWIW, the vader btl (available in Open MPI 1.8) has the same limitation,
thought i heard there are some works in progress to get rid of this 
limitation.


Cheers,

Gilles

On 5/14/2015 3:52 PM, Radoslaw Martyniszyn wrote:

Dear developers of Open MPI,

I've created two applications: parent and child. Parent spawns 
children using MPI_Comm_spawn. I would like to use shared memory when 
they communicate. However, applications do not start when I try using 
sm. Please comment on that issue. If this feature is not supported, 
are there any plans to add support? Also, are there any examples 
showing MPI_Comm_spawn and shared memory?


I am using Open MPI 1.6.5 on Ubuntu. Both applications are run locally 
on the same host.


// Works fine
mpirun --mca btl self,tcp ./parent

// Application terminates
mpirun --mca btl self,sm ./parent

"At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL."

Below are code snippets:

parent.cc:
#include 
#include 

int main(int argc, char** argv) {
  MPI_Init(NULL, NULL);

  std::string lProgram = "./child";
  MPI_Comm lIntercomm;
  int lRv;
  lRv = MPI_Comm_spawn( const_cast< char* >(lProgram.c_str()), 
MPI_ARGV_NULL, 3,

   MPI_INFO_NULL, 0, MPI_COMM_WORLD, &lIntercomm,
   MPI_ERRCODES_IGNORE);

  if ( MPI_SUCCESS == lRv) {
  std::cout << "SPAWN SUCCESS" << std::endl;
  sleep(10);
  }
  else {
  std::cout << "SPAWN ERROR " << lRv << std::endl;
  }

  MPI_Finalize();
}

child.cc:
#include 
#include 
#include 

int main(int argc, char** argv) {
  // Initialize the MPI environment
  MPI_Init(NULL, NULL);

  std::cout << "CHILD" << std::endl;
  sleep(10);

  MPI_Finalize();
}

makefile (note, there are tabs not spaces preceding each target):
EXECS=child parent
MPICC?=mpic++

all: ${EXECS}

child: child.cc
${MPICC} -o child child.cc

parent: parent.cc
${MPICC} -o parent parent.cc

clean:
rm -f ${EXECS}


Greetings to all of you,
Radek Martyniszyn





___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26865.php




[OMPI users] MPI_Comm_spawn and shared memory

2015-05-14 Thread Radoslaw Martyniszyn
Dear developers of Open MPI,

I've created two applications: parent and child. Parent spawns children
using MPI_Comm_spawn. I would like to use shared memory when they
communicate. However, applications do not start when I try using sm. Please
comment on that issue. If this feature is not supported, are there any
plans to add support? Also, are there any examples showing MPI_Comm_spawn
and shared memory?

I am using Open MPI 1.6.5 on Ubuntu. Both applications are run locally on
the same host.

// Works fine
mpirun --mca btl self,tcp ./parent

// Application terminates
mpirun --mca btl self,sm ./parent

"At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL."

Below are code snippets:

parent.cc:
#include 
#include 

int main(int argc, char** argv) {
  MPI_Init(NULL, NULL);

  std::string lProgram = "./child";
  MPI_Comm lIntercomm;
  int lRv;
  lRv = MPI_Comm_spawn( const_cast< char* >(lProgram.c_str()),
MPI_ARGV_NULL, 3,
   MPI_INFO_NULL, 0, MPI_COMM_WORLD, &lIntercomm,
   MPI_ERRCODES_IGNORE);

  if ( MPI_SUCCESS == lRv) {
  std::cout << "SPAWN SUCCESS" << std::endl;
  sleep(10);
  }
  else {
  std::cout << "SPAWN ERROR " << lRv << std::endl;
  }

  MPI_Finalize();
}

child.cc:
#include 
#include 
#include 

int main(int argc, char** argv) {
  // Initialize the MPI environment
  MPI_Init(NULL, NULL);

  std::cout << "CHILD" << std::endl;
  sleep(10);

  MPI_Finalize();
}

makefile (note, there are tabs not spaces preceding each target):
EXECS=child parent
MPICC?=mpic++

all: ${EXECS}

child: child.cc
${MPICC} -o child child.cc

parent: parent.cc
${MPICC} -o parent parent.cc

clean:
rm -f ${EXECS}


Greetings to all of you,
Radek Martyniszyn
#include 
#include 
#include 

int main(int argc, char** argv) {
  // Initialize the MPI environment
  MPI_Init(NULL, NULL);

  std::cout << "CHILD" << std::endl;
  sleep(10);

  MPI_Finalize();
}


makefile
Description: Binary data
#include 
#include 
#include 
#include 

int main(int argc, char** argv) {
  MPI_Init(NULL, NULL);

  std::string lProgram = "./child";
  MPI_Comm lIntercomm;
  int lRv;
  lRv = MPI_Comm_spawn( const_cast< char* >(lProgram.c_str()), MPI_ARGV_NULL, 3,
   MPI_INFO_NULL, 0, MPI_COMM_WORLD, &lIntercomm,
   MPI_ERRCODES_IGNORE);

  if ( MPI_SUCCESS == lRv) {
  std::cout << "SPAWN SUCCESS" << std::endl;
  sleep(10);
  }
  else {
  std::cout << "SPAWN ERROR " << lRv << std::endl;
  }

  MPI_Finalize();
}



Re: [OMPI users] mpi_comm_spawn question

2014-07-03 Thread Milan Hodoscek
> "George" == George Bosilca  writes:

George> Why are you using system() the second time ? As you want
George> to spawn an MPI application calling MPI_Call_spawn would
George> make everything simpler.

Yes, this works! Very good trick... The system routine would be more
flexible, but for the method we are working now mpi_comm_spawn is also
OK.

Thanks -- Milan


Re: [OMPI users] mpi_comm_spawn question

2014-07-03 Thread George Bosilca
Why are you using system() the second time ? As you want to spawn an MPI
application calling MPI_Call_spawn would make everything simpler.

George

On Jul 3, 2014 4:34 PM, "Milan Hodoscek"  wrote:
>
> Hi,
>
> I am trying to run the following setup in fortran without much
> success:
>
> I have an MPI program, that uses mpi_comm_spawn which spawns some
> interface program that communicates with the one that spawned it. This
> spawned program then prepares some data and uses call system()
> statement in fortran. Now if the program that is called from system is
> not mpi program itself everything is running OK. But I want to run the
> program with something like mpirun -n X ... and then this is a no go.
>
> Different versions of open mpi give different messages before they
> either die or hang. I googled all the messages but all I get is just
> links to some openmpi sources, so I would appreciate if someone can
> help me explain how to run above setup. Given so many MCA options I
> hope there is one which can run the above setup ??
>
> The message for 1.6 is the following:
> ... routed:binomial: connection to lifeline lost (+ PIDs and port numbers)
>
> The message for 1.8.1 is:
> ... FORKING HNP: orted --hnp --set-sid --report-uri 18
--singleton-died-pipe 19 -mca state_novm_select 1 -mca ess_base_jobid
3378249728
>
>
> If this is not trivial to solve problem I can provide a simple test
> programs (we need 3) that show all of this.
>
> Thanks,
>
>
> Milan Hodoscek
> --
> National Institute of Chemistry  tel:+386-1-476-0278
> Hajdrihova 19fax:+386-1-476-0300
> SI-1000 Ljubljanae-mail: mi...@cmm.ki.si
> Slovenia web: http://a.cmm.ki.si
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2014/07/24744.php


Re: [OMPI users] mpi_comm_spawn question

2014-07-03 Thread Ralph Castain
Unfortunately, that has never been supported. The problem is that the embedded 
mpirun picks up all those MCA params that were provided to the original 
application process, and gets hopelessly confused. We have tried in the past to 
figure out a solution, but it has proved difficult to separate those params 
that were set during launch of the original child from ones you are trying to 
provide to the embedded mpirun.

So it remains an "unsupported" operation.


On Jul 3, 2014, at 7:34 AM, Milan Hodoscek  wrote:

> Hi,
> 
> I am trying to run the following setup in fortran without much
> success:
> 
> I have an MPI program, that uses mpi_comm_spawn which spawns some
> interface program that communicates with the one that spawned it. This
> spawned program then prepares some data and uses call system()
> statement in fortran. Now if the program that is called from system is
> not mpi program itself everything is running OK. But I want to run the
> program with something like mpirun -n X ... and then this is a no go.
> 
> Different versions of open mpi give different messages before they
> either die or hang. I googled all the messages but all I get is just
> links to some openmpi sources, so I would appreciate if someone can
> help me explain how to run above setup. Given so many MCA options I
> hope there is one which can run the above setup ??
> 
> The message for 1.6 is the following:
> ... routed:binomial: connection to lifeline lost (+ PIDs and port numbers)
> 
> The message for 1.8.1 is:
> ... FORKING HNP: orted --hnp --set-sid --report-uri 18 --singleton-died-pipe 
> 19 -mca state_novm_select 1 -mca ess_base_jobid 3378249728
> 
> 
> If this is not trivial to solve problem I can provide a simple test
> programs (we need 3) that show all of this.
> 
> Thanks,
> 
> 
> Milan Hodoscek  
> --
> National Institute of Chemistry  tel:+386-1-476-0278
> Hajdrihova 19fax:+386-1-476-0300
> SI-1000 Ljubljanae-mail: mi...@cmm.ki.si  
> Slovenia web: http://a.cmm.ki.si
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24744.php



[OMPI users] mpi_comm_spawn question

2014-07-03 Thread Milan Hodoscek
Hi,

I am trying to run the following setup in fortran without much
success:

I have an MPI program, that uses mpi_comm_spawn which spawns some
interface program that communicates with the one that spawned it. This
spawned program then prepares some data and uses call system()
statement in fortran. Now if the program that is called from system is
not mpi program itself everything is running OK. But I want to run the
program with something like mpirun -n X ... and then this is a no go.

Different versions of open mpi give different messages before they
either die or hang. I googled all the messages but all I get is just
links to some openmpi sources, so I would appreciate if someone can
help me explain how to run above setup. Given so many MCA options I
hope there is one which can run the above setup ??

The message for 1.6 is the following:
... routed:binomial: connection to lifeline lost (+ PIDs and port numbers)

The message for 1.8.1 is:
... FORKING HNP: orted --hnp --set-sid --report-uri 18 --singleton-died-pipe 19 
-mca state_novm_select 1 -mca ess_base_jobid 3378249728


If this is not trivial to solve problem I can provide a simple test
programs (we need 3) that show all of this.

Thanks,


Milan Hodoscek  
--
National Institute of Chemistry  tel:+386-1-476-0278
Hajdrihova 19fax:+386-1-476-0300
SI-1000 Ljubljanae-mail: mi...@cmm.ki.si  
Slovenia web: http://a.cmm.ki.si


Re: [OMPI users] MPI_Comm_spawn and exported variables

2013-12-20 Thread Ralph Castain
Funny, but I couldn't find the code path that supported that in the latest 1.6 
series release (didn't check earlier ones) - but no matter, it seems logical 
enough. Fixed in the trunk and cmr'd to 1.7.4

Thanks!
Ralph

On Dec 19, 2013, at 8:08 PM, Tim Miller  wrote:

> Hi Ralph,
> 
> That's correct. All of the original processes see the -x values, but spawned 
> ones do not.
> 
> Regards,
> Tim
> 
> 
> On Thu, Dec 19, 2013 at 6:09 PM, Ralph Castain  wrote:
> 
> On Dec 19, 2013, at 2:57 PM, Tim Miller  wrote:
> 
> > Hi All,
> >
> > I have a question similar (but not identical to) the one asked by Tom Fogel 
> > a week or so back...
> >
> > I have a code that uses MPI_Comm_spawn to launch different processes. The 
> > executables for these use libraries in non-standard locations, so what I've 
> > done is add the directories containing them to my LD_LIBRARY_PATH 
> > environment variable, and then calling mpirun with "-x LD_LIBRARY_PATH". 
> > This works well for me on OpenMPI 1.6.3 and earlier. However, I've been 
> > playing with OpenMPI 1.7.3 and this no longer seems to work. As soon as my 
> > code MPI_Comm_spawns, all the spawned processes die complaining that they 
> > can't find the correct libraries to start the executable.
> >
> > Has there been a way that exported variables are passed to spawned 
> > processes between OpenMPI 1.6 and 1.7?
> 
> Not intentionally, though it is possible that some bug crept into the code. 
> If I understand correctly, the -x values are being seen by the original 
> procs, but not by the comm_spawned ones?
> 
> 
> > Is there something else I'm doing wrong here?
> >
> > Best Regards,
> > Tim
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Tim Miller
Hi Ralph,

That's correct. All of the original processes see the -x values, but
spawned ones do not.

Regards,
Tim


On Thu, Dec 19, 2013 at 6:09 PM, Ralph Castain  wrote:

>
> On Dec 19, 2013, at 2:57 PM, Tim Miller  wrote:
>
> > Hi All,
> >
> > I have a question similar (but not identical to) the one asked by Tom
> Fogel a week or so back...
> >
> > I have a code that uses MPI_Comm_spawn to launch different processes.
> The executables for these use libraries in non-standard locations, so what
> I've done is add the directories containing them to my LD_LIBRARY_PATH
> environment variable, and then calling mpirun with "-x LD_LIBRARY_PATH".
> This works well for me on OpenMPI 1.6.3 and earlier. However, I've been
> playing with OpenMPI 1.7.3 and this no longer seems to work. As soon as my
> code MPI_Comm_spawns, all the spawned processes die complaining that they
> can't find the correct libraries to start the executable.
> >
> > Has there been a way that exported variables are passed to spawned
> processes between OpenMPI 1.6 and 1.7?
>
> Not intentionally, though it is possible that some bug crept into the
> code. If I understand correctly, the -x values are being seen by the
> original procs, but not by the comm_spawned ones?
>
>
> > Is there something else I'm doing wrong here?
> >
> > Best Regards,
> > Tim
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Ralph Castain

On Dec 19, 2013, at 2:57 PM, Tim Miller  wrote:

> Hi All,
> 
> I have a question similar (but not identical to) the one asked by Tom Fogel a 
> week or so back...
> 
> I have a code that uses MPI_Comm_spawn to launch different processes. The 
> executables for these use libraries in non-standard locations, so what I've 
> done is add the directories containing them to my LD_LIBRARY_PATH environment 
> variable, and then calling mpirun with "-x LD_LIBRARY_PATH". This works well 
> for me on OpenMPI 1.6.3 and earlier. However, I've been playing with OpenMPI 
> 1.7.3 and this no longer seems to work. As soon as my code MPI_Comm_spawns, 
> all the spawned processes die complaining that they can't find the correct 
> libraries to start the executable.
> 
> Has there been a way that exported variables are passed to spawned processes 
> between OpenMPI 1.6 and 1.7?

Not intentionally, though it is possible that some bug crept into the code. If 
I understand correctly, the -x values are being seen by the original procs, but 
not by the comm_spawned ones?


> Is there something else I'm doing wrong here?
> 
> Best Regards,
> Tim
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Tim Miller
Hi All,

I have a question similar (but not identical to) the one asked by Tom Fogel
a week or so back...

I have a code that uses MPI_Comm_spawn to launch different processes. The
executables for these use libraries in non-standard locations, so what I've
done is add the directories containing them to my LD_LIBRARY_PATH
environment variable, and then calling mpirun with "-x LD_LIBRARY_PATH".
This works well for me on OpenMPI 1.6.3 and earlier. However, I've been
playing with OpenMPI 1.7.3 and this no longer seems to work. As soon as my
code MPI_Comm_spawns, all the spawned processes die complaining that they
can't find the correct libraries to start the executable.

Has there been a way that exported variables are passed to spawned
processes between OpenMPI 1.6 and 1.7? Is there something else I'm doing
wrong here?

Best Regards,
Tim


Re: [OMPI users] MPI_Comm_spawn and exit of parent process.

2012-06-18 Thread Ralph Castain
One further point that I missed in my earlier note: if you are starting the 
parent as a singleton, then you are fooling yourself about the "without mpirun" 
comment. A singleton immediately starts a local daemon to act as mpirun so that 
comm_spawn will work. Otherwise, there is no way to launch the child processes.

So you might as well just launch the "child" job directly with mpirun - the 
result is exactly the same. If you truly want the job to use all the cores, one 
proc per core, and don't want to tell it the number of cores, then use the OMPI 
devel trunk where we have added support for such patterns. All you would have 
to do is:

mpirun -ppr 1:core --bind-to core ./my_app

and you are done.


On Jun 18, 2012, at 4:27 AM, TERRY DONTJE wrote:

> On 6/16/2012 8:03 AM, Roland Schulz wrote:
>> 
>> Hi,
>> 
>> I would like to start a single process without mpirun and then use 
>> MPI_Comm_spawn to start up as many processes as required. I don't want the 
>> parent process to take up any resources, so I tried to disconnect the inter 
>> communicator and then finalize mpi and exit the parent. But as soon as I do 
>> that the children exit too. Why is that? Can I somehow change that behavior? 
>> Or can I wait on the children to exit without the waiting taking up CPU time?
>> 
>> The reason I don't need the parent as soon as the children are spawned, is 
>> that I need one intra-communicator over all processes. And as far as I know 
>> I cannot join the parent and children to one intra-communicator. 
> You could use MPI_Intercomm_merge to create an intra-communicator out of the 
> groups in an inter-communicator and pass the inter-communicator you get back 
> from the MPI_Comm_spawn call.
> 
> --td
>> 
>> The purpose of the whole exercise is that I want that my program to use all 
>> cores of a node by default when executed without mpirun.
>> 
>> I have tested this with OpenMPI 1.4.5. A sample program is here: 
>> http://pastebin.com/g2XSZwvY . "Child finalized" is only printed with the 
>> sleep(2) in the parent not commented out.
>> 
>> Roland
>> 
>> -- 
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>> 865-241-1537, ORNL PO BOX 2008 MS6309
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn and exit of parent process.

2012-06-18 Thread TERRY DONTJE

On 6/16/2012 8:03 AM, Roland Schulz wrote:

Hi,

I would like to start a single process without mpirun and then use 
MPI_Comm_spawn to start up as many processes as required. I don't want 
the parent process to take up any resources, so I tried to disconnect 
the inter communicator and then finalize mpi and exit the parent. But 
as soon as I do that the children exit too. Why is that? Can I somehow 
change that behavior? Or can I wait on the children to exit without 
the waiting taking up CPU time?


The reason I don't need the parent as soon as the children are 
spawned, is that I need one intra-communicator over all processes. And 
as far as I know I cannot join the parent and children to one 
intra-communicator.
You could use MPI_Intercomm_merge to create an intra-communicator out of 
the groups in an inter-communicator and pass the inter-communicator you 
get back from the MPI_Comm_spawn call.


--td


The purpose of the whole exercise is that I want that my program to 
use all cores of a node by default when executed without mpirun.


I have tested this with OpenMPI 1.4.5. A sample program is here: 
http://pastebin.com/g2XSZwvY . "Child finalized" is only printed with 
the sleep(2) in the parent not commented out.


Roland

--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 
865-241-1537, ORNL PO BOX 2008 MS6309


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] MPI_Comm_spawn and exit of parent process.

2012-06-16 Thread Ralph Castain
I'm afraid there is no option to keep the job alive if the parent exits. I 
could give you several reasons for that behavior, but the bottom line is that 
it can't be changed.

Why don't you have the parent loop across "sleep", waking up periodically to 
check for a "we are done" message from a child? That would take essentially no 
CPU and meet your need.


On Jun 16, 2012, at 6:03 AM, Roland Schulz wrote:

> Hi,
> 
> I would like to start a single process without mpirun and then use 
> MPI_Comm_spawn to start up as many processes as required. I don't want the 
> parent process to take up any resources, so I tried to disconnect the inter 
> communicator and then finalize mpi and exit the parent. But as soon as I do 
> that the children exit too. Why is that? Can I somehow change that behavior? 
> Or can I wait on the children to exit without the waiting taking up CPU time?
> 
> The reason I don't need the parent as soon as the children are spawned, is 
> that I need one intra-communicator over all processes. And as far as I know I 
> cannot join the parent and children to one intra-communicator. 
> 
> The purpose of the whole exercise is that I want that my program to use all 
> cores of a node by default when executed without mpirun.
> 
> I have tested this with OpenMPI 1.4.5. A sample program is here: 
> http://pastebin.com/g2XSZwvY . "Child finalized" is only printed with the 
> sleep(2) in the parent not commented out.
> 
> Roland
> 
> -- 
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] MPI_Comm_spawn and exit of parent process.

2012-06-16 Thread Roland Schulz
Hi,

I would like to start a single process without mpirun and then use
MPI_Comm_spawn to start up as many processes as required. I don't want the
parent process to take up any resources, so I tried to disconnect the inter
communicator and then finalize mpi and exit the parent. But as soon as I do
that the children exit too. Why is that? Can I somehow change that
behavior? Or can I wait on the children to exit without the waiting taking
up CPU time?

The reason I don't need the parent as soon as the children are spawned, is
that I need one intra-communicator over all processes. And as far as I know
I cannot join the parent and children to one intra-communicator.

The purpose of the whole exercise is that I want that my program to use all
cores of a node by default when executed without mpirun.

I have tested this with OpenMPI 1.4.5. A sample program is here:
http://pastebin.com/g2XSZwvY . "Child finalized" is only printed with the
sleep(2) in the parent not commented out.

Roland

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309


[OMPI users] MPI_Comm_spawn problem

2011-12-05 Thread Fernanda Oliveira
Hi,

I'm working with MPI_Comm_spawn and I have some error messages.

The code is relatively simple:

#include 
#include 
#include 
#include 
#include 

int main(int argc, char ** argv){

int i;
int rank, size, child_rank;
char nomehost[20];
MPI_Comm parent, intercomm1, intercomm2;
int erro;
int level, curr_level;


MPI_Init(&argc, &argv);
level = atoi(argv[1]);

MPI_Comm_get_parent(&parent);

if(parent == MPI_COMM_NULL){
rank=0;
}
else{
MPI_Recv(&rank, 1, MPI_INT, 0, 0, parent, MPI_STATUS_IGNORE);
}

curr_level = (int) log2(rank+1);

printf(" --> rank: %d and curr_level: %d\n", rank, curr_level);

// Node propagation
if(curr_level < level){

// 2^(curr_level+1) - 1 + 2*(rank - 2^curr_level - 1)
= 2*rank + 1
child_rank = 2*rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm1, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);

MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm1);

//sleep(1);

child_rank = child_rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm2, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);

MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm2);

}

gethostname(nomehost, 20);
printf("(%d) in %s\n", rank, nomehost);

MPI_Finalize();
return(0);

}

The program will create a binary tree of process until get a specific
level determined by the variable "level". If the level is 2, the tree
will be:
(0)
  / \
  (1)   (2)
  /  \   /  \
(3) (4)  (5) (6)

Error messages are (when a use 1 host):

Compiling: mpicc test.c -o test -lm
Running: mpirun -np 1 ./test 3

 --> rank: 0 and curr_level: 0
(0) Before create rank 1
(0) After create rank 1
(0) Before create rank 2
 --> rank: 1 and curr_level: 1
(1) Before create rank 3
[cacau.ic.uff.br:17892] [[31928,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 75

When I use 2 hosts, error is worst. The code is similar to the writing
here (I have to set hosts before spawn by MPI_Info_set).
Using MPILAM, program runs normally.

I think something wrong occurs when I try to use 2 MPI_Comm_spawn
consecutively and children processes spawn another processes too.
Seems to be a race condition because the error does not always happen
(when the level is 2, for example). Using 3 levels or more, error is
recurrent.

Similar error has been previously posted in another thread:
http://www.open-mpi.org/community/lists/users/2009/12/11601.php
However, I used the stable version 1.4.4 and this problem still happens.
Developers think of to fix it?

Thanks,
Fernanda


Re: [OMPI users] MPI_Comm_Spawn intercommunication

2011-01-22 Thread Jeff Squyres
Try using MPI_COMM_REMOTE_SIZE to get the size of the remote group in an 
intercommunicator.  MPI_COMM_SIZE returns the size of the local group.


On Jan 7, 2011, at 6:22 PM, Pierre Chanial wrote:

> Hello,
> 
> When I run this code:
> 
> program testcase
> 
> use mpi
> implicit none
> 
> integer :: rank, lsize, rsize, code
> integer :: intercomm
> 
> call MPI_INIT(code)
> 
> call MPI_COMM_GET_PARENT(intercomm, code)
> if (intercomm == MPI_COMM_NULL) then
> call MPI_COMM_SPAWN ("./testcase", MPI_ARGV_NULL, 1, MPI_INFO_NULL, &
>  0, MPI_COMM_WORLD, intercomm, MPI_ERRCODES_IGNORE, code)
> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code)
> call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
> call MPI_COMM_SIZE(intercomm, rsize, code)
> if (rank == 0) then
> print *, 'from parent: local size is ', lsize
> print *, 'from parent: remote size is ', rsize
> end if
> else
> call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
> call MPI_COMM_SIZE(intercomm, rsize, code)
> print *, 'from child: local size is ', lsize
> print *, 'from child: remote size is ', rsize
> end if
> 
> call MPI_FINALIZE (code)
> 
> end program testcase
> 
> I get the following results with openmpi 1.4.1 and two processes:
>  from parent: local size is2  
>
>  from parent: remote size is2 
>
>  from child: local size is1   
>
>  from child: remote size is1  
>
> 
> I would have expected:
>  from parent: local size is2  
> 
>  from parent: remote size is1 
>   
>  from child: local size is1   
> 
>  from child: remote size is2  
>  
> 
> Could anyone tell me what's going on ? It's not a fortran issue, I can also 
> replicate it using mpi4py.
> Probably related to the universe size: I haven't found a way to hand it to 
> mpirun.
> 
> Cheers,
> Pierre
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] MPI_Comm_Spawn intercommunication

2011-01-07 Thread Pierre Chanial
Hello,

When I run this code:

program testcase

use mpi
implicit none

integer :: rank, lsize, rsize, code
integer :: intercomm

call MPI_INIT(code)

call MPI_COMM_GET_PARENT(intercomm, code)
if (intercomm == MPI_COMM_NULL) then
call MPI_COMM_SPAWN ("./testcase", MPI_ARGV_NULL, 1, MPI_INFO_NULL,
&
 0, MPI_COMM_WORLD, intercomm, MPI_ERRCODES_IGNORE, code)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code)
call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
call MPI_COMM_SIZE(intercomm, rsize, code)
if (rank == 0) then
print *, 'from parent: local size is ', lsize
print *, 'from parent: remote size is ', rsize
end if
else
call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
call MPI_COMM_SIZE(intercomm, rsize, code)
print *, 'from child: local size is ', lsize
print *, 'from child: remote size is ', rsize
end if

call MPI_FINALIZE (code)

end program testcase

I get the following results with openmpi 1.4.1 and two processes:
 from parent: local size is
2

 from parent: remote size is
2

 from child: local size is
1

 from child: remote size is
1


I would have expected:
 from parent: local size is
2

 from parent: remote size is1


 from child: local size is
1

 from child: remote size is2



Could anyone tell me what's going on ? It's not a fortran issue, I can also
replicate it using mpi4py.
Probably related to the universe size: I haven't found a way to hand it to
mpirun.

Cheers,
Pierre


Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Milan Hodoscek
> "Ralph" == Ralph Castain  writes:

Ralph> On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

>>> "Ralph" == Ralph Castain  writes:
>> 
Ralph> I'm not sure why the group communicator would make a
Ralph> difference - the code area in question knows nothing about
Ralph> the mpi aspects of the job. It looks like you are hitting a
Ralph> race condition that causes a particular internal recv to
Ralph> not exist when we subsequently try to cancel it, which
Ralph> generates that error message.  How did you configure OMPI?
>> 
>> Thank you for the reply!
>> 
>> Must be some race problem, but I have no control of it, or do
>> I?

Ralph> Not really. What I don't understand is why your code would
Ralph> work fine when using comm_world, but encounter a race
Ralph> condition when using comm groups. There shouldn't be any
Ralph> timing difference between the two cases.

Fixing race condition is sometime easy by puting some variables into
the arrays. I just did for one of them but it didn't help. I'll do
some more testing in this direction, but I am running out of ideas.
When you put ngrp=1 and uncomment the other mpi_comm_spawn line in the
program you basically get only one spawn, so no opportunity for race
condition. But in my real project I usually work with many spawn
calls, however all using mpi_comm_world, but running different
programs, etc. And that always works. This time I want to localize
mpi_comm_spawns by similar trick that is in the program I sent. So
this small test case is a good model of what I would like to have.
I studied the MPI-2 standard and I think I got it right, but one never
knows...

Ralph> I'll have to take a look and see if I can spot something in
Ralph> the code...

Thanks a lot -- Milan


Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Ralph Castain

On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

>> "Ralph" == Ralph Castain  writes:
> 
>Ralph> I'm not sure why the group communicator would make a
>Ralph> difference - the code area in question knows nothing about
>Ralph> the mpi aspects of the job. It looks like you are hitting a
>Ralph> race condition that causes a particular internal recv to
>Ralph> not exist when we subsequently try to cancel it, which
>Ralph> generates that error message.  How did you configure OMPI?
> 
> Thank you for the reply!
> 
> Must be some race problem, but I have no control of it, or do I?

Not really. What I don't understand is why your code would work fine when using 
comm_world, but encounter a race condition when using comm groups. There 
shouldn't be any timing difference between the two cases.

> 
> These are the configure options that gentoo compiles openmpi-1.4.2 with:
> 
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu 
> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info 
> --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib 
> --libdir=/usr/lib64 --sysconfdir=/etc/openmpi --without-xgrid 
> --enable-pretty-print-stacktrace --enable-orterun-prefix-by-default 
> --without-slurm --enable-contrib-no-build=vt --enable-mpi-cxx 
> --disable-io-romio --disable-heterogeneous --without-tm --enable-ipv6
> 

This looks okay.

I'll have to take a look and see if I can spot something in the code...




Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Milan Hodoscek
> "Ralph" == Ralph Castain  writes:

Ralph> I'm not sure why the group communicator would make a
Ralph> difference - the code area in question knows nothing about
Ralph> the mpi aspects of the job. It looks like you are hitting a
Ralph> race condition that causes a particular internal recv to
Ralph> not exist when we subsequently try to cancel it, which
Ralph> generates that error message.  How did you configure OMPI?

Thank you for the reply!

Must be some race problem, but I have no control of it, or do I?

These are the configure options that gentoo compiles openmpi-1.4.2 with:

./configure --prefix=/usr --build=x86_64-pc-linux-gnu 
--host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info 
--datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib 
--libdir=/usr/lib64 --sysconfdir=/etc/openmpi --without-xgrid 
--enable-pretty-print-stacktrace --enable-orterun-prefix-by-default 
--without-slurm --enable-contrib-no-build=vt --enable-mpi-cxx 
--disable-io-romio --disable-heterogeneous --without-tm --enable-ipv6



Re: [OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-04 Thread Ralph Castain
I'm not sure why the group communicator would make a difference - the code area 
in question knows nothing about the mpi aspects of the job. It looks like you 
are hitting a race condition that causes a particular internal recv to not 
exist when we subsequently try to cancel it, which generates that error message.

How did you configure OMPI?


On Oct 3, 2010, at 6:40 PM, Milan Hodoscek wrote:

> Hi,
> 
> I am a long time happy user of mpi_comm_spawn() routine. But so far I
> used it only with the MPI_COMM_WORLD communicator. Now I want to
> execute more mpi_comm_spawn() routines, by creating and using group
> communicators. However this seems to have some problems. I can get it
> to run about 50% times on my laptop, but on some more "speedy"
> machines it just produces the following message:
> 
> $ mpirun -n 4 a.out
> [ala:31406] [[45304,0],0] ORTE_ERROR_LOG: Not found in file 
> base/plm_base_launch_support.c at line 758
> --
> mpirun was unable to start the specified application as it encountered an 
> error.
> More information may be available above.
> --
> 
> I am attaching the 2 programs needed to test the behavior. Compile:
> $ mpif90 -o sps sps.f08 # spawned program
> $ mpif90 mspbug.f08 # program with problems
> $ mpirun -n 4 a.out
> 
> The compiler is gfortran-4.4.4, and openmpi is 1.4.2.
> 
> Needless to say it runs with mpich2, but mpich2 doesn't know how to
> deal with stdin on a spawned process, so it's useless for my project :-(
> 
> Any ideas?
> 
> -
> program sps
>  use mpi
>  implicit none
>  integer :: ier,nproc,me,pcomm,meroot,mi,on
>  integer, dimension(1:10) :: num
> 
>  call mpi_init(ier)
> 
>  mi=mpi_integer
>  call mpi_comm_rank(mpi_comm_world,me,ier)
>  meroot=0
> 
>  on=1
> 
>  call mpi_comm_get_parent(pcomm,ier)
> 
>  call mpi_bcast(num,on,mi,meroot,pcomm,ier)
>  write(*,*)'sps>me,num=',me,num(on)
> 
>  call mpi_finalize(ier)
> 
> end program sps
> -
> 
> program groupspawn
> 
>  use mpi
> 
>  implicit none
>  ! in the case use mpi does not work (eg Ubuntu) use the include below
>  ! include 'mpif.h'
>  integer :: ier,intercom,nproc,meroot,info,mpierrs(1),mcw
>  integer :: i,myrepsiz,me,np,mcg,repdgrp,repdcom,on,mi,op
>  integer, dimension(1:10) :: myrepgrp
>  character(len=5) :: sarg(1),prog
>  integer, dimension(1:10) :: num,sm
>  integer :: newme,ngrp,igrp
> 
>  call mpi_init(ier)
> 
>  prog='sps'
>  sarg(1) = ''
>  nproc=2
>  on=1
>  meroot=0
>  mcw=mpi_comm_world
>  info=mpi_info_null
>  mi=mpi_integer
>  op=mpi_sum
>  mpierrs(1)=mpi_errcodes_ignore(1)
> 
>  call mpi_comm_rank(mcw,me,ier)
>  call mpi_comm_size(mcw,np,ier)
> 
>  ngrp=2  ! lets have some groups
>  myrepsiz=np/ngrp
>  igrp=me/myrepsiz
>  do i = 1, myrepsiz
>myrepgrp(i)=i+me-mod(me,myrepsiz)-1
>  enddo
> 
>  call mpi_comm_group(mcw,mcg,ier)
>  call mpi_group_incl(mcg,myrepsiz,myrepgrp,repdgrp,ier)
>  call mpi_comm_create(mcw,repdgrp,repdcom,ier)
> 
> !  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,mcw,intercom,mpierrs,ier)
>  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,repdcom,intercom,mpierrs,ier)
> 
>  ! send a number to spawned ones...
> 
>  call mpi_comm_rank(intercom,newme,ier)
>  write(*,*)'me,intercom,newme=',me,intercom,newme
>  num(1)=111*(igrp+1)
> 
>  meroot=mpi_proc_null
>  if(newme == 0) meroot=mpi_root ! to send data
> 
>  call mpi_bcast(num,on,mi,meroot,intercom,ier)
>  ! sometimes there is no output from sps programs, so we wait here: WEIRD :-(
>  !call sleep(1)
> 
>  call mpi_finalize(ier)
> 
> end program groupspawn
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] mpi_comm_spawn have problems with group communicators

2010-10-03 Thread Milan Hodoscek
Hi,

I am a long time happy user of mpi_comm_spawn() routine. But so far I
used it only with the MPI_COMM_WORLD communicator. Now I want to
execute more mpi_comm_spawn() routines, by creating and using group
communicators. However this seems to have some problems. I can get it
to run about 50% times on my laptop, but on some more "speedy"
machines it just produces the following message:

$ mpirun -n 4 a.out
[ala:31406] [[45304,0],0] ORTE_ERROR_LOG: Not found in file 
base/plm_base_launch_support.c at line 758
--
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
--

I am attaching the 2 programs needed to test the behavior. Compile:
$ mpif90 -o sps sps.f08 # spawned program
$ mpif90 mspbug.f08 # program with problems
$ mpirun -n 4 a.out

The compiler is gfortran-4.4.4, and openmpi is 1.4.2.

Needless to say it runs with mpich2, but mpich2 doesn't know how to
deal with stdin on a spawned process, so it's useless for my project :-(

Any ideas?

-
program sps
  use mpi
  implicit none
  integer :: ier,nproc,me,pcomm,meroot,mi,on
  integer, dimension(1:10) :: num

  call mpi_init(ier)

  mi=mpi_integer
  call mpi_comm_rank(mpi_comm_world,me,ier)
  meroot=0

  on=1

  call mpi_comm_get_parent(pcomm,ier)

  call mpi_bcast(num,on,mi,meroot,pcomm,ier)
  write(*,*)'sps>me,num=',me,num(on)

  call mpi_finalize(ier)

end program sps
-

program groupspawn

  use mpi

  implicit none
  ! in the case use mpi does not work (eg Ubuntu) use the include below
  ! include 'mpif.h'
  integer :: ier,intercom,nproc,meroot,info,mpierrs(1),mcw
  integer :: i,myrepsiz,me,np,mcg,repdgrp,repdcom,on,mi,op
  integer, dimension(1:10) :: myrepgrp
  character(len=5) :: sarg(1),prog
  integer, dimension(1:10) :: num,sm
  integer :: newme,ngrp,igrp

  call mpi_init(ier)

  prog='sps'
  sarg(1) = ''
  nproc=2
  on=1
  meroot=0
  mcw=mpi_comm_world
  info=mpi_info_null
  mi=mpi_integer
  op=mpi_sum
  mpierrs(1)=mpi_errcodes_ignore(1)

  call mpi_comm_rank(mcw,me,ier)
  call mpi_comm_size(mcw,np,ier)

  ngrp=2  ! lets have some groups
  myrepsiz=np/ngrp
  igrp=me/myrepsiz
  do i = 1, myrepsiz
myrepgrp(i)=i+me-mod(me,myrepsiz)-1
  enddo

  call mpi_comm_group(mcw,mcg,ier)
  call mpi_group_incl(mcg,myrepsiz,myrepgrp,repdgrp,ier)
  call mpi_comm_create(mcw,repdgrp,repdcom,ier)

!  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,mcw,intercom,mpierrs,ier)
  call mpi_comm_spawn(prog,sarg,nproc,info,meroot,repdcom,intercom,mpierrs,ier)

  ! send a number to spawned ones...

  call mpi_comm_rank(intercom,newme,ier)
  write(*,*)'me,intercom,newme=',me,intercom,newme
  num(1)=111*(igrp+1)

  meroot=mpi_proc_null
  if(newme == 0) meroot=mpi_root ! to send data

  call mpi_bcast(num,on,mi,meroot,intercom,ier)
  ! sometimes there is no output from sps programs, so we wait here: WEIRD :-(
  !call sleep(1)

  call mpi_finalize(ier)

end program groupspawn


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-18 Thread Nicolas Bock
Hi Ralph,

I have confirmed that openmpi-1.4a1r22335 works with my master, slave
example. The temporary directories are cleaned up properly.

Thanks for the help!

nick


On Thu, Dec 17, 2009 at 13:38, Nicolas Bock  wrote:

> Ok, I'll give it a try.
>
> Thanks, nick
>
>
>
> On Thu, Dec 17, 2009 at 12:44, Ralph Castain  wrote:
>
>> In case you missed it, this patch should be in the 1.4 nightly tarballs -
>> feel free to test and let me know what you find.
>>
>> Thanks
>> Ralph
>>
>> On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote:
>>
>> That was quick. I will try the patch as soon as you release it.
>>
>> nick
>>
>>
>> On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:
>>
>>> Patch is built and under review...
>>>
>>> Thanks again
>>> Ralph
>>>
>>> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
>>>
>>> Thanks
>>>
>>> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>>>
 Yeah, that's the one all right! Definitely missing from 1.3.x.

 Thanks - I'll build a patch for the next bug-fix release


 On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:

 > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain 
 wrote:
 >> Indeed - that is very helpful! Thanks!
 >> Looks like we aren't cleaning up high enough - missing the directory
 level.
 >> I seem to recall seeing that error go by and that someone fixed it on
 our
 >> devel trunk, so this is likely a repair that didn't get moved over to
 the
 >> release branch as it should have done.
 >> I'll look into it and report back.
 >
 > You are probably referring to
 > https://svn.open-mpi.org/trac/ompi/changeset/21498
 >
 > There was an issue about orte_session_dir_finalize() not
 > cleaning up the session directories properly.
 >
 > Hope that helps.
 >
 > Abhishek
 >
 >> Thanks again
 >> Ralph
 >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
 >>
 >>
 >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain 
 wrote:
 >>>
 >>> Hmmif you are willing to keep trying, could you perhaps let it
 run for
 >>> a brief time, ctrl-z it, and then do an ls on a directory from a
 process
 >>> that has already terminated? The pids will be in order, so just look
 for an
 >>> early number (not mpirun or the parent, of course).
 >>> It would help if you could give us the contents of a directory from
 a
 >>> child process that has terminated - would tell us what subsystem is
 failing
 >>> to properly cleanup.
 >>
 >> Ok, so I Ctrl-Z the master. In
 >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only
 one
 >> directory
 >>
 >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
 >>
 >> I can't find that PID though. mpirun has PID 4230, orted does not
 exist,
 >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
 again,
 >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
 there
 >> are 70 sequentially numbered directories starting at 0. Every
 directory
 >> contains another directory called "0". There is nothing in any of
 those
 >> directories. I see for instance:
 >>
 >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
 >> total 4.0K
 >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
 >>
 >> and
 >>
 >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $
 ls -lh
 >> 70/0/
 >> total 0
 >>
 >> I hope this information helps. Did I understand your question
 correctly?
 >>
 >> nick
 >>
 >> ___
 >> users mailing list
 >> us...@open-mpi.org
 >> http://www.open-mpi.org/mailman/listinfo.cgi/users
 >>
 >> ___
 >> users mailing list
 >> us...@open-mpi.org
 >> http://www.open-mpi.org/mailman/listinfo.cgi/users
 >>
 >
 > ___
 > users mailing list
 > us...@open-mpi.org
 > http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listi

Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Nicolas Bock
Ok, I'll give it a try.

Thanks, nick


On Thu, Dec 17, 2009 at 12:44, Ralph Castain  wrote:

> In case you missed it, this patch should be in the 1.4 nightly tarballs -
> feel free to test and let me know what you find.
>
> Thanks
> Ralph
>
> On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote:
>
> That was quick. I will try the patch as soon as you release it.
>
> nick
>
>
> On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:
>
>> Patch is built and under review...
>>
>> Thanks again
>> Ralph
>>
>> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
>>
>> Thanks
>>
>> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>>
>>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>>>
>>> Thanks - I'll build a patch for the next bug-fix release
>>>
>>>
>>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>>>
>>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain 
>>> wrote:
>>> >> Indeed - that is very helpful! Thanks!
>>> >> Looks like we aren't cleaning up high enough - missing the directory
>>> level.
>>> >> I seem to recall seeing that error go by and that someone fixed it on
>>> our
>>> >> devel trunk, so this is likely a repair that didn't get moved over to
>>> the
>>> >> release branch as it should have done.
>>> >> I'll look into it and report back.
>>> >
>>> > You are probably referring to
>>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>>> >
>>> > There was an issue about orte_session_dir_finalize() not
>>> > cleaning up the session directories properly.
>>> >
>>> > Hope that helps.
>>> >
>>> > Abhishek
>>> >
>>> >> Thanks again
>>> >> Ralph
>>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>>> >>
>>> >>
>>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>>> >>>
>>> >>> Hmmif you are willing to keep trying, could you perhaps let it
>>> run for
>>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a
>>> process
>>> >>> that has already terminated? The pids will be in order, so just look
>>> for an
>>> >>> early number (not mpirun or the parent, of course).
>>> >>> It would help if you could give us the contents of a directory from a
>>> >>> child process that has terminated - would tell us what subsystem is
>>> failing
>>> >>> to properly cleanup.
>>> >>
>>> >> Ok, so I Ctrl-Z the master. In
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>>> >> directory
>>> >>
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>>> >>
>>> >> I can't find that PID though. mpirun has PID 4230, orted does not
>>> exist,
>>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
>>> again,
>>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
>>> there
>>> >> are 70 sequentially numbered directories starting at 0. Every
>>> directory
>>> >> contains another directory called "0". There is nothing in any of
>>> those
>>> >> directories. I see for instance:
>>> >>
>>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>>> >> total 4.0K
>>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>>> >>
>>> >> and
>>> >>
>>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls
>>> -lh
>>> >> 70/0/
>>> >> total 0
>>> >>
>>> >> I hope this information helps. Did I understand your question
>>> correctly?
>>> >>
>>> >> nick
>>> >>
>>> >> ___
>>> >> users mailing list
>>> >> us...@open-mpi.org
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>
>>> >> ___
>>> >> users mailing list
>>> >> us...@open-mpi.org
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Ralph Castain
In case you missed it, this patch should be in the 1.4 nightly tarballs - feel 
free to test and let me know what you find.

Thanks
Ralph

On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote:

> That was quick. I will try the patch as soon as you release it.
> 
> nick
> 
> 
> On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:
> Patch is built and under review...
> 
> Thanks again
> Ralph
> 
> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
> 
>> Thanks
>> 
>> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>> 
>> Thanks - I'll build a patch for the next bug-fix release
>> 
>> 
>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>> 
>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
>> >> Indeed - that is very helpful! Thanks!
>> >> Looks like we aren't cleaning up high enough - missing the directory 
>> >> level.
>> >> I seem to recall seeing that error go by and that someone fixed it on our
>> >> devel trunk, so this is likely a repair that didn't get moved over to the
>> >> release branch as it should have done.
>> >> I'll look into it and report back.
>> >
>> > You are probably referring to
>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>> >
>> > There was an issue about orte_session_dir_finalize() not
>> > cleaning up the session directories properly.
>> >
>> > Hope that helps.
>> >
>> > Abhishek
>> >
>> >> Thanks again
>> >> Ralph
>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>> >>
>> >>
>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>> >>>
>> >>> Hmmif you are willing to keep trying, could you perhaps let it run 
>> >>> for
>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a process
>> >>> that has already terminated? The pids will be in order, so just look for 
>> >>> an
>> >>> early number (not mpirun or the parent, of course).
>> >>> It would help if you could give us the contents of a directory from a
>> >>> child process that has terminated - would tell us what subsystem is 
>> >>> failing
>> >>> to properly cleanup.
>> >>
>> >> Ok, so I Ctrl-Z the master. In
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>> >> directory
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>> >>
>> >> I can't find that PID though. mpirun has PID 4230, orted does not exist,
>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
>> >> are 70 sequentially numbered directories starting at 0. Every directory
>> >> contains another directory called "0". There is nothing in any of those
>> >> directories. I see for instance:
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>> >> total 4.0K
>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>> >>
>> >> and
>> >>
>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
>> >> 70/0/
>> >> total 0
>> >>
>> >> I hope this information helps. Did I understand your question correctly?
>> >>
>> >> nick
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
On Fri, Dec 4, 2009 at 12:10, Eugene Loh  wrote:

>  Nicolas Bock wrote:
>
> On Fri, Dec 4, 2009 at 10:29, Eugene Loh  wrote:
>
>> I think you might observe a world of difference if the master issued some
>> non-blocking call and then intermixed MPI_Test calls with sleep calls.  You
>> should see *much* more subservient behavior.  As I remember, putting such
>> passivity into OMPI is on somebody's to-do list, but just not very high.
>>
>
> could you give me more details?
>
> Nope, sorry.  I'm going to drop out here.  The basic idea was something
> like:
>
> MPI_Irecv();
> while ( !flag ) {
>   nanosleep(...);
>   MPI_Test(...&flag...);
> }
>
> but I was hoping to "leave the rest to the reader".
>
>
HI Eugene,

thanks for the help. I think I got it now:

master.c:

MPI_Irecv(buffer+buffer_index, 1, MPI_CHAR, MPI_ANY_SOURCE, 0, spawn,
request+buffer_index);

and slave.c

MPI_Send(&buffer, 1, MPI_CHAR, 0, 0, spawn);

That seems to do the trick. Since our "slave" processes are expected to have
rather long runtimes, the sleep statement in master is simply

sleep(2);

to sleep 2 seconds. The load on the master process is basically zero now.

Thanks again for your help,

nick


Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Eugene Loh




Nicolas Bock wrote:
On Fri, Dec 4, 2009 at 10:29, Eugene Loh 
wrote:
  
  
I think you might observe a
world of difference if the master issued
some non-blocking call and then intermixed MPI_Test calls with sleep
calls.  You should see *much* more subservient behavior.  As I
remember, putting such passivity into OMPI is on somebody's to-do list,
but just not very high.

  
  
could you give me more details?
  

Nope, sorry.  I'm going to drop out here.  The basic idea was something
like:

MPI_Irecv();
while ( !flag ) {
  nanosleep(...);
  MPI_Test(...&flag...);
}

but I was hoping to "leave the rest to the reader".

  
  I can't figure out how to do this. I could see that one way to
implement what you are describing is:
  
in slave.c:
MPI_Send() to rank 0
  
in master.c
MPI_IRecv() from the spawned processes
while (1) {  MPI_Test(); }
  
I can't figure out how to find the ranks that MPI_Comm_spawn() used.
What's the source argument in MPI_IRecv() supposed to be?
  
  





Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
On Fri, Dec 4, 2009 at 10:29, Eugene Loh  wrote:

>  Nicolas Bock wrote:
>
> On Fri, Dec 4, 2009 at 10:10, Eugene Loh  wrote:
>
>> Yield helped, but not as effectively as one might have imagined.
>>
>
> Yes, that's the impression I get as well, the master process might be
> yielding, but it doesn't appear to be a lot. Maybe I should do this
> differently to avoid this CPU usage in master. All I really want is to
> execute another process somewhere on a free node in my MPI universe, wait
> for it to be done and go on. From my limited understanding of MPI,
> MPI_Comm_spawn() and MPI_Barrier() seemed just like what I needed, but as I
> said, maybe there are other ways to do this.
>
> I think you might observe a world of difference if the master issued some
> non-blocking call and then intermixed MPI_Test calls with sleep calls.  You
> should see *much* more subservient behavior.  As I remember, putting such
> passivity into OMPI is on somebody's to-do list, but just not very high.
>

Hi Eugene,

could you give me more details? I can't figure out how to do this. I could
see that one way to implement what you are describing is:

in slave.c:
MPI_Send() to rank 0

in master.c
MPI_IRecv() from the spawned processes
while (1) {  MPI_Test(); }

I can't figure out how to find the ranks that MPI_Comm_spawn() used. What's
the source argument in MPI_IRecv() supposed to be?

Thanks, nick


Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Eugene Loh




Nicolas Bock wrote:

  On Fri, Dec 4, 2009 at 10:10, Eugene Loh 
wrote:
  
Yield helped, but
not as effectively as one might have imagined.

  
  
Yes, that's the impression I get as well, the master process might be
yielding, but it doesn't appear to be a lot. Maybe I should do this
differently to avoid this CPU usage in master. All I really want is to
execute another process somewhere on a free node in my MPI universe,
wait for it to be done and go on. From my limited understanding of MPI,
MPI_Comm_spawn() and MPI_Barrier() seemed just like what I needed, but
as I said, maybe there are other ways to do this.
  
  

I think you might observe a world of difference if the master issued
some non-blocking call and then intermixed MPI_Test calls with sleep
calls.  You should see *much* more subservient behavior.  As I
remember, putting such passivity into OMPI is on somebody's to-do list,
but just not very high.




Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
On Fri, Dec 4, 2009 at 10:10, Eugene Loh  wrote:

>  Nicolas Bock wrote:
>
> On Fri, Dec 4, 2009 at 08:21, Ralph Castain  wrote:
>
>> You used it correctly. Remember, all that cpu number is telling you is the
>> percentage of use by that process. So bottom line is: we are releasing it as
>> much as we possibly can, but no other process wants to use the cpu, so we go
>> ahead and use it.
>>
>>  If any other process wanted it, then the percentage would drop and the
>> other proc would take some.
>>
>  When you say "the other proc would take some", how much do you expect it
> to take? In my case above, the master process does not appear to yield a
> whole lot. Can I reduce the polling frequency? I know that my slave
> processes typically run several minutes to hours.
>
> In my (limited) experience, the situation is a little of both.  OMPI is
> yielding.  Yielding makes a difference only if someone else wants the CPU.
> But even if someone else wants the CPU, OMPI in yield mode will still be
> consuming cycles.  It's like the way I drive a car.  When I approach a stop
> sign, I slow down -- really, officer, I do -- and if there is cross traffic
> I let it go by ahead of me.  But if there is no cross traffic, I, ahem,
> proceed expediently.  And, even if there is cross traffic, their progress is
> still impacted by me -- heck, I'm all for obeying stop signs and all, but
> I'm no doormat.  OMPI processes can yield, but they only check to yield
> every now and then.  Between checks, they are not timid processes, even if
> other processes are waiting to run.  I once had some numbers on this.  Yield
> helped, but not as effectively as one might have imagined.
>

Yes, that's the impression I get as well, the master process might be
yielding, but it doesn't appear to be a lot. Maybe I should do this
differently to avoid this CPU usage in master. All I really want is to
execute another process somewhere on a free node in my MPI universe, wait
for it to be done and go on. From my limited understanding of MPI,
MPI_Comm_spawn() and MPI_Barrier() seemed just like what I needed, but as I
said, maybe there are other ways to do this.

nick


Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Eugene Loh




Nicolas Bock wrote:

  On Fri, Dec 4, 2009 at 08:21, Ralph Castain 
wrote:
  
You used it correctly. Remember, all that cpu number
is telling you is the percentage of use by that process. So bottom line
is: we are releasing it as much as we possibly can, but no other
process wants to use the cpu, so we go ahead and use it.


If any other process wanted it, then the percentage would drop
and the other proc would take some.

  
  
When you say "the other proc would take some", how much do you expect
it to take? In my case above, the master process does not appear to
yield a whole lot. Can I reduce the polling frequency? I know that my
slave processes typically run several minutes to hours.

In my (limited) experience, the situation is a little of both.  OMPI is
yielding.  Yielding makes a difference only if someone else wants the
CPU.  But even if someone else wants the CPU, OMPI in yield mode will
still be consuming cycles.  It's like the way I drive a car.  When I
approach a stop sign, I slow down -- really, officer, I do -- and if
there is cross traffic I let it go by ahead of me.  But if there is no
cross traffic, I, ahem, proceed expediently.  And, even if there is
cross traffic, their progress is still impacted by me -- heck, I'm all
for obeying stop signs and all, but I'm no doormat.  OMPI processes can
yield, but they only check to yield every now and then.  Between
checks, they are not timid processes, even if other processes are
waiting to run.  I once had some numbers on this.  Yield helped, but
not as effectively as one might have imagined.




Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
On Fri, Dec 4, 2009 at 08:21, Ralph Castain  wrote:

> You used it correctly. Remember, all that cpu number is telling you is the
> percentage of use by that process. So bottom line is: we are releasing it as
> much as we possibly can, but no other process wants to use the cpu, so we go
> ahead and use it.
>
> If any other process wanted it, then the percentage would drop and the
> other proc would take some.
>
>
> I have a quadcore CPU, so when I run with "-np 4" I get this

nbock25699  0.0  0.0  53980  2312 pts/2S+   08:23   0:00
/usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 4 --mca
mpi_yield_when_idle 1 ./master
nbock25700 71.0  0.0 158964  3876 pts/2R+   08:23   0:45 ./master
nbock25701  0.0  0.0 158960  3804 pts/2S+   08:23   0:00 ./master
nbock25702  0.0  0.0 158960  3804 pts/2S+   08:23   0:00 ./master
nbock25703  0.0  0.0 158960  3804 pts/2S+   08:23   0:00 ./master
nbock25704 76.1  0.0 158964  3900 pts/2R+   08:23   0:47 ./slave
arg1 arg2
nbock25705 81.3  0.0 158964  3904 pts/2R+   08:23   0:51 ./slave
arg1 arg2
nbock25706 79.2  0.0 158964  3904 pts/2R+   08:23   0:49 ./slave
arg1 arg2
nbock25707 77.4  0.0 158964  3908 pts/2R+   08:23   0:48 ./slave
arg1 arg2

When you say "the other proc would take some", how much do you expect it to
take? In my case above, the master process does not appear to yield a whole
lot. Can I reduce the polling frequency? I know that my slave processes
typically run several minutes to hours.

nick


Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Ralph Castain
You used it correctly. Remember, all that cpu number is telling you is the 
percentage of use by that process. So bottom line is: we are releasing it as 
much as we possibly can, but no other process wants to use the cpu, so we go 
ahead and use it.

If any other process wanted it, then the percentage would drop and the other 
proc would take some.

On Dec 4, 2009, at 8:13 AM, Nicolas Bock wrote:

> On Fri, Dec 4, 2009 at 08:03, Ralph Castain  wrote:
> 
> 
> It is polling at the barrier. This is done aggressively by default for 
> performance. You can tell it to be less aggressive if you want via the 
> yield_when_idle mca param.
> 
> 
> How do I use this parameter correctly? I tried
> 
> /usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 --mca mpi_yield_when_idle 
> 1 ./master
> 
> but still get
> 
> nbock20794  0.0  0.0  53980  2344 pts/2S+   08:11   0:00 
> /usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 --mca mpi_yield_when_idle 
> 1 ./master
> nbock20795 96.7  0.0 159096  3896 pts/2R+   08:11   1:10 ./master
> nbock20796  0.0  0.0 158960  3804 pts/2S+   08:11   0:00 ./master
> nbock20797  0.0  0.0 158960  3804 pts/2S+   08:11   0:00 ./master
> nbock20813 88.1  0.0 158964  3892 pts/2R+   08:12   0:08 ./slave arg1 
> arg2
> nbock20814 86.9  0.0 158964  3908 pts/2R+   08:12   0:08 ./slave arg1 
> arg2
> nbock20815 87.5  0.0 158964  3900 pts/2R+   08:12   0:08 ./slave arg1 
> arg2
> 
> nick
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
On Fri, Dec 4, 2009 at 08:03, Ralph Castain  wrote:

>
>
> It is polling at the barrier. This is done aggressively by default for
> performance. You can tell it to be less aggressive if you want via the
> yield_when_idle mca param.
>
>
How do I use this parameter correctly? I tried

/usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 --mca
mpi_yield_when_idle 1 ./master

but still get

nbock20794  0.0  0.0  53980  2344 pts/2S+   08:11   0:00
/usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 --mca
mpi_yield_when_idle 1 ./master
nbock20795 96.7  0.0 159096  3896 pts/2R+   08:11   1:10 ./master
nbock20796  0.0  0.0 158960  3804 pts/2S+   08:11   0:00 ./master
nbock20797  0.0  0.0 158960  3804 pts/2S+   08:11   0:00 ./master
nbock20813 88.1  0.0 158964  3892 pts/2R+   08:12   0:08 ./slave
arg1 arg2
nbock20814 86.9  0.0 158964  3908 pts/2R+   08:12   0:08 ./slave
arg1 arg2
nbock20815 87.5  0.0 158964  3900 pts/2R+   08:12   0:08 ./slave
arg1 arg2

nick


Re: [OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Ralph Castain

On Dec 4, 2009, at 7:46 AM, Nicolas Bock wrote:

> Hello list,
> 
> when I run the attached example, which spawns a "slave" process with 
> MPI_Comm_spawn(), I see the following:
> 
> nbock19911  0.0  0.0  53980  2288 pts/0S+   07:42   0:00 
> /usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 ./master
> nbock19912 92.1  0.0 158964  3868 pts/0R+   07:42   0:23 ./master
> nbock19913  0.0  0.0 158960  3812 pts/0S+   07:42   0:00 ./master
> nbock19914  0.0  0.0 158960  3800 pts/0S+   07:42   0:00 ./master
> nbock19929 91.1  0.0 158964  3896 pts/0R+   07:42   0:20 ./slave arg1 
> arg2
> nbock19930 95.8  0.0 158964  3900 pts/0R+   07:42   0:22 ./slave arg1 
> arg2
> nbock19931 94.7  0.0 158964  3896 pts/0R+   07:42   0:21 ./slave arg1 
> arg2
> 
> The third column is the CPU usage according to top. I notice 3 master 
> processes, which I attribute to the fact that MPI_Comm_spawn really fork()s 
> and then spawns, but that's my uneducated guess.

Ummmif you look at your cmd line

/usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 ./master

 you will see that you specified 3 copies of master be run

:-)

> What I don't understand is why PID 19912 is using any CPU resources at all. 
> It's supposed to be waiting at the MPI_Barrier() for the slaves to finish. 
> What is PID 19912 doing?

It is polling at the barrier. This is done aggressively by default for 
performance. You can tell it to be less aggressive if you want via the 
yield_when_idle mca param.

> 
> Some more information:
> 
> $ uname -a
> Linux mujo 2.6.31-gentoo-r6 #2 SMP PREEMPT Fri Dec 4 07:08:07 MST 2009 x86_64 
> Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux
> 
> openmpi version 1.3.4
> gcc version 4.4.2
> 
> nick
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] MPI_Comm_spawn, caller uses CPU while waiting for spawned processes

2009-12-04 Thread Nicolas Bock
Hello list,

when I run the attached example, which spawns a "slave" process with
MPI_Comm_spawn(), I see the following:

nbock19911  0.0  0.0  53980  2288 pts/0S+   07:42   0:00
/usr/local/openmpi-1.3.4-gcc-4.4.2/bin/mpirun -np 3 ./master
nbock19912 92.1  0.0 158964  3868 pts/0R+   07:42   0:23 ./master
nbock19913  0.0  0.0 158960  3812 pts/0S+   07:42   0:00 ./master
nbock19914  0.0  0.0 158960  3800 pts/0S+   07:42   0:00 ./master
nbock19929 91.1  0.0 158964  3896 pts/0R+   07:42   0:20 ./slave
arg1 arg2
nbock19930 95.8  0.0 158964  3900 pts/0R+   07:42   0:22 ./slave
arg1 arg2
nbock19931 94.7  0.0 158964  3896 pts/0R+   07:42   0:21 ./slave
arg1 arg2

The third column is the CPU usage according to top. I notice 3 master
processes, which I attribute to the fact that MPI_Comm_spawn really fork()s
and then spawns, but that's my uneducated guess. What I don't understand is
why PID 19912 is using any CPU resources at all. It's supposed to be waiting
at the MPI_Barrier() for the slaves to finish. What is PID 19912 doing?

Some more information:

$ uname -a
Linux mujo 2.6.31-gentoo-r6 #2 SMP PREEMPT Fri Dec 4 07:08:07 MST 2009
x86_64 Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux

openmpi version 1.3.4
gcc version 4.4.2

nick
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  int rank;
  int size;
  int *error_codes;
  int spawn_counter = 0;
  char *slave_argv[] = { "arg1", "arg2", 0 };
  MPI_Comm spawn;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (rank == 0)
  {
printf("[master] running on %i processors\n", size);

while (1)
{
  printf("[master] (%i) forking processes\n", spawn_counter++);
  error_codes = (int*) malloc(sizeof(int)*size);
  MPI_Comm_spawn("./slave", slave_argv, size, MPI_INFO_NULL, 0, MPI_COMM_SELF, &spawn, error_codes);
  printf("[master] waiting at barrier\n");
  MPI_Barrier(spawn);
  free(error_codes);
}
  }

  MPI_Finalize();
}
#include 
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  int rank;
  int size;
  int i, j;
  double temp;
  MPI_Comm spawn;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("[slave %i] working\n", rank);
  for (i = 0; i < 1; ++i) {
for (j = 0; j < 50; ++j)
{
  temp = rand();
}
  }

  printf("[slave %i] waiting at barrier\n", rank);
  MPI_Comm_get_parent(&spawn);
  MPI_Barrier(spawn);

  MPI_Finalize();
}


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-03 Thread Nicolas Bock
That was quick. I will try the patch as soon as you release it.

nick


On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:

> Patch is built and under review...
>
> Thanks again
> Ralph
>
> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
>
> Thanks
>
> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>
>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>>
>> Thanks - I'll build a patch for the next bug-fix release
>>
>>
>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>>
>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
>> >> Indeed - that is very helpful! Thanks!
>> >> Looks like we aren't cleaning up high enough - missing the directory
>> level.
>> >> I seem to recall seeing that error go by and that someone fixed it on
>> our
>> >> devel trunk, so this is likely a repair that didn't get moved over to
>> the
>> >> release branch as it should have done.
>> >> I'll look into it and report back.
>> >
>> > You are probably referring to
>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>> >
>> > There was an issue about orte_session_dir_finalize() not
>> > cleaning up the session directories properly.
>> >
>> > Hope that helps.
>> >
>> > Abhishek
>> >
>> >> Thanks again
>> >> Ralph
>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>> >>
>> >>
>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>> >>>
>> >>> Hmmif you are willing to keep trying, could you perhaps let it run
>> for
>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a
>> process
>> >>> that has already terminated? The pids will be in order, so just look
>> for an
>> >>> early number (not mpirun or the parent, of course).
>> >>> It would help if you could give us the contents of a directory from a
>> >>> child process that has terminated - would tell us what subsystem is
>> failing
>> >>> to properly cleanup.
>> >>
>> >> Ok, so I Ctrl-Z the master. In
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>> >> directory
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>> >>
>> >> I can't find that PID though. mpirun has PID 4230, orted does not
>> exist,
>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
>> again,
>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
>> there
>> >> are 70 sequentially numbered directories starting at 0. Every directory
>> >> contains another directory called "0". There is nothing in any of those
>> >> directories. I see for instance:
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>> >> total 4.0K
>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>> >>
>> >> and
>> >>
>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls
>> -lh
>> >> 70/0/
>> >> total 0
>> >>
>> >> I hope this information helps. Did I understand your question
>> correctly?
>> >>
>> >> nick
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Ralph Castain
Patch is built and under review...

Thanks again
Ralph

On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:

> Thanks
> 
> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
> Yeah, that's the one all right! Definitely missing from 1.3.x.
> 
> Thanks - I'll build a patch for the next bug-fix release
> 
> 
> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
> 
> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
> >> Indeed - that is very helpful! Thanks!
> >> Looks like we aren't cleaning up high enough - missing the directory level.
> >> I seem to recall seeing that error go by and that someone fixed it on our
> >> devel trunk, so this is likely a repair that didn't get moved over to the
> >> release branch as it should have done.
> >> I'll look into it and report back.
> >
> > You are probably referring to
> > https://svn.open-mpi.org/trac/ompi/changeset/21498
> >
> > There was an issue about orte_session_dir_finalize() not
> > cleaning up the session directories properly.
> >
> > Hope that helps.
> >
> > Abhishek
> >
> >> Thanks again
> >> Ralph
> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
> >>
> >>
> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
> >>>
> >>> Hmmif you are willing to keep trying, could you perhaps let it run for
> >>> a brief time, ctrl-z it, and then do an ls on a directory from a process
> >>> that has already terminated? The pids will be in order, so just look for 
> >>> an
> >>> early number (not mpirun or the parent, of course).
> >>> It would help if you could give us the contents of a directory from a
> >>> child process that has terminated - would tell us what subsystem is 
> >>> failing
> >>> to properly cleanup.
> >>
> >> Ok, so I Ctrl-Z the master. In
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
> >> directory
> >>
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
> >>
> >> I can't find that PID though. mpirun has PID 4230, orted does not exist,
> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
> >> are 70 sequentially numbered directories starting at 0. Every directory
> >> contains another directory called "0". There is nothing in any of those
> >> directories. I see for instance:
> >>
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
> >> total 4.0K
> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
> >>
> >> and
> >>
> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
> >> 70/0/
> >> total 0
> >>
> >> I hope this information helps. Did I understand your question correctly?
> >>
> >> nick
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Nicolas Bock
Thanks

On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:

> Yeah, that's the one all right! Definitely missing from 1.3.x.
>
> Thanks - I'll build a patch for the next bug-fix release
>
>
> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>
> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
> >> Indeed - that is very helpful! Thanks!
> >> Looks like we aren't cleaning up high enough - missing the directory
> level.
> >> I seem to recall seeing that error go by and that someone fixed it on
> our
> >> devel trunk, so this is likely a repair that didn't get moved over to
> the
> >> release branch as it should have done.
> >> I'll look into it and report back.
> >
> > You are probably referring to
> > https://svn.open-mpi.org/trac/ompi/changeset/21498
> >
> > There was an issue about orte_session_dir_finalize() not
> > cleaning up the session directories properly.
> >
> > Hope that helps.
> >
> > Abhishek
> >
> >> Thanks again
> >> Ralph
> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
> >>
> >>
> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
> >>>
> >>> Hmmif you are willing to keep trying, could you perhaps let it run
> for
> >>> a brief time, ctrl-z it, and then do an ls on a directory from a
> process
> >>> that has already terminated? The pids will be in order, so just look
> for an
> >>> early number (not mpirun or the parent, of course).
> >>> It would help if you could give us the contents of a directory from a
> >>> child process that has terminated - would tell us what subsystem is
> failing
> >>> to properly cleanup.
> >>
> >> Ok, so I Ctrl-Z the master. In
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
> >> directory
> >>
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
> >>
> >> I can't find that PID though. mpirun has PID 4230, orted does not exist,
> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
> again,
> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
> there
> >> are 70 sequentially numbered directories starting at 0. Every directory
> >> contains another directory called "0". There is nothing in any of those
> >> directories. I see for instance:
> >>
> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
> >> total 4.0K
> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
> >>
> >> and
> >>
> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls
> -lh
> >> 70/0/
> >> total 0
> >>
> >> I hope this information helps. Did I understand your question correctly?
> >>
> >> nick
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Ralph Castain
Yeah, that's the one all right! Definitely missing from 1.3.x.

Thanks - I'll build a patch for the next bug-fix release


On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:

> On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
>> Indeed - that is very helpful! Thanks!
>> Looks like we aren't cleaning up high enough - missing the directory level.
>> I seem to recall seeing that error go by and that someone fixed it on our
>> devel trunk, so this is likely a repair that didn't get moved over to the
>> release branch as it should have done.
>> I'll look into it and report back.
> 
> You are probably referring to
> https://svn.open-mpi.org/trac/ompi/changeset/21498
> 
> There was an issue about orte_session_dir_finalize() not
> cleaning up the session directories properly.
> 
> Hope that helps.
> 
> Abhishek
> 
>> Thanks again
>> Ralph
>> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>> 
>> 
>> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>>> 
>>> Hmmif you are willing to keep trying, could you perhaps let it run for
>>> a brief time, ctrl-z it, and then do an ls on a directory from a process
>>> that has already terminated? The pids will be in order, so just look for an
>>> early number (not mpirun or the parent, of course).
>>> It would help if you could give us the contents of a directory from a
>>> child process that has terminated - would tell us what subsystem is failing
>>> to properly cleanup.
>> 
>> Ok, so I Ctrl-Z the master. In
>> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>> directory
>> 
>> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>> 
>> I can't find that PID though. mpirun has PID 4230, orted does not exist,
>> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
>> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
>> are 70 sequentially numbered directories starting at 0. Every directory
>> contains another directory called "0". There is nothing in any of those
>> directories. I see for instance:
>> 
>> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>> total 4.0K
>> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>> 
>> and
>> 
>> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
>> 70/0/
>> total 0
>> 
>> I hope this information helps. Did I understand your question correctly?
>> 
>> nick
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Abhishek Kulkarni
On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
> Indeed - that is very helpful! Thanks!
> Looks like we aren't cleaning up high enough - missing the directory level.
> I seem to recall seeing that error go by and that someone fixed it on our
> devel trunk, so this is likely a repair that didn't get moved over to the
> release branch as it should have done.
> I'll look into it and report back.

You are probably referring to
https://svn.open-mpi.org/trac/ompi/changeset/21498

There was an issue about orte_session_dir_finalize() not
cleaning up the session directories properly.

Hope that helps.

Abhishek

> Thanks again
> Ralph
> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>
>
> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>>
>> Hmmif you are willing to keep trying, could you perhaps let it run for
>> a brief time, ctrl-z it, and then do an ls on a directory from a process
>> that has already terminated? The pids will be in order, so just look for an
>> early number (not mpirun or the parent, of course).
>> It would help if you could give us the contents of a directory from a
>> child process that has terminated - would tell us what subsystem is failing
>> to properly cleanup.
>
> Ok, so I Ctrl-Z the master. In
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
> directory
>
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>
> I can't find that PID though. mpirun has PID 4230, orted does not exist,
> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
> are 70 sequentially numbered directories starting at 0. Every directory
> contains another directory called "0". There is nothing in any of those
> directories. I see for instance:
>
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
> total 4.0K
> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>
> and
>
> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
> 70/0/
> total 0
>
> I hope this information helps. Did I understand your question correctly?
>
> nick
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Ralph Castain
Indeed - that is very helpful! Thanks!

Looks like we aren't cleaning up high enough - missing the directory level. I 
seem to recall seeing that error go by and that someone fixed it on our devel 
trunk, so this is likely a repair that didn't get moved over to the release 
branch as it should have done.

I'll look into it and report back.

Thanks again
Ralph

On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:

> 
> 
> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
> Hmmif you are willing to keep trying, could you perhaps let it run for a 
> brief time, ctrl-z it, and then do an ls on a directory from a process that 
> has already terminated? The pids will be in order, so just look for an early 
> number (not mpirun or the parent, of course).
> 
> It would help if you could give us the contents of a directory from a child 
> process that has terminated - would tell us what subsystem is failing to 
> properly cleanup.
> 
> Ok, so I Ctrl-Z the master. In  
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one 
> directory
> 
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
> 
> I can't find that PID though. mpirun has PID 4230, orted does not exist, 
> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again, 
> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there are 
> 70 sequentially numbered directories starting at 0. Every directory contains 
> another directory called "0". There is nothing in any of those directories. I 
> see for instance:
> 
> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
> total 4.0K
> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
> 
> and
> 
> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 
> 70/0/
> total 0
> 
> I hope this information helps. Did I understand your question correctly?
> 
> nick
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Nicolas Bock
On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:

> Hmmif you are willing to keep trying, could you perhaps let it run for
> a brief time, ctrl-z it, and then do an ls on a directory from a process
> that has already terminated? The pids will be in order, so just look for an
> early number (not mpirun or the parent, of course).
>
> It would help if you could give us the contents of a directory from a child
> process that has terminated - would tell us what subsystem is failing to
> properly cleanup.
>

Ok, so I Ctrl-Z the master. In
/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
directory

/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857

I can't find that PID though. mpirun has PID 4230, orted does not exist,
master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again,
slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there
are 70 sequentially numbered directories starting at 0. Every directory
contains another directory called "0". There is nothing in any of those
directories. I see for instance:

/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
total 4.0K
drwx-- 2 nbock users 4.0K Dec  2 14:41 0

and

nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh
70/0/
total 0

I hope this information helps. Did I understand your question correctly?

nick


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Ralph Castain
Hmmif you are willing to keep trying, could you perhaps let it run for a 
brief time, ctrl-z it, and then do an ls on a directory from a process that has 
already terminated? The pids will be in order, so just look for an early number 
(not mpirun or the parent, of course).

It would help if you could give us the contents of a directory from a child 
process that has terminated - would tell us what subsystem is failing to 
properly cleanup.

Thanks - and sorry for the problem.

On Dec 2, 2009, at 2:11 PM, Nicolas Bock wrote:

> 
> 
> On Wed, Dec 2, 2009 at 12:12, Ralph Castain  wrote:
> 
> On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:
> 
>> 
>> 
>> On Tue, Dec 1, 2009 at 20:58, Nicolas Bock  wrote:
>> 
>> 
>> On Tue, Dec 1, 2009 at 18:03, Ralph Castain  wrote:
>> You may want to check your limits as defined by the shell/system. I can also 
>> run this for as long as I'm willing to let it run, so something else appears 
>> to be going on.
>> 
>> 
>> 
>> Is that with 1.3.3? I found that with 1.3.4 I can run the example much 
>> longer until I hit this error message:
>> 
>> 
>> [master] (31996) forking processes
>> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the 
>> sub-directory 
>> (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998) of 
>> (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir 
>> failed [1]
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> util/session_dir.c at line 101
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> util/session_dir.c at line 425
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
>> base/ess_base_std_app.c at line 132
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_session_dir failed
>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>> 
>> 
>> After some googling I found that this is apparently an ext3 filesystem 
>> limitation, i.e. there can be only 31998 subdirectories in a directory. Why 
>> is openmpi creating all of these directories in the first place? Is there a 
>> way to "recycle" them?
> 
> The session directories are built to house shared memory backing files, plus 
> other potential entries depending upon options. They should be deleted upon 
> finalize of each process, so you shouldn't be running out of them.
> 
> I can check to see that the code is cleaning them out (or at least, 
> attempting to do so). Not sure if there is something about ext3 that might 
> retain the directory entries until the "parent" process terminates, even 
> though the files have been deleted.
> 
> If you do an ls on the directory tree, do you see 32k subdirectories? Or do 
> you only see the ones for the active processes?
> 
> That's a good point. As the master process is running I can see the directory 
> fill up. When I Ctrl-C the master, the directory completely disappears. When 
> I let it run all the way to 32K directories, the directory does not disappear 
> and contains 32K directories even after master gets killed by MPI.
> 
> Some process must not be closing some file in these directories which would 
> prevent them from being unlinked, if I understand ext3 correctly.
> 
> nick
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Nicolas Bock
On Wed, Dec 2, 2009 at 12:12, Ralph Castain  wrote:

>
> On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:
>
>
>
> On Tue, Dec 1, 2009 at 20:58, Nicolas Bock  wrote:
>
>>
>>
>> On Tue, Dec 1, 2009 at 18:03, Ralph Castain  wrote:
>>
>>> You may want to check your limits as defined by the shell/system. I can
>>> also run this for as long as I'm willing to let it run, so something else
>>> appears to be going on.
>>>
>>>
>>>
>> Is that with 1.3.3? I found that with 1.3.4 I can run the example much
>> longer until I hit this error message:
>>
>>
>> [master] (31996) forking processes
>> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the
>> sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998)
>> of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0),
>> mkdir failed [1]
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 101
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 425
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> base/ess_base_std_app.c at line 132
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_session_dir failed
>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>>
>>
> After some googling I found that this is apparently an ext3 filesystem
> limitation, i.e. there can be only 31998 subdirectories in a directory. Why
> is openmpi creating all of these directories in the first place? Is there a
> way to "recycle" them?
>
>
> The session directories are built to house shared memory backing files,
> plus other potential entries depending upon options. They should be deleted
> upon finalize of each process, so you shouldn't be running out of them.
>
> I can check to see that the code is cleaning them out (or at least,
> attempting to do so). Not sure if there is something about ext3 that might
> retain the directory entries until the "parent" process terminates, even
> though the files have been deleted.
>
> If you do an ls on the directory tree, do you see 32k subdirectories? Or do
> you only see the ones for the active processes?
>
> That's a good point. As the master process is running I can see the
directory fill up. When I Ctrl-C the master, the directory completely
disappears. When I let it run all the way to 32K directories, the directory
does not disappear and contains 32K directories even after master gets
killed by MPI.

Some process must not be closing some file in these directories which would
prevent them from being unlinked, if I understand ext3 correctly.

nick


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Ralph Castain

On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:

> 
> 
> On Tue, Dec 1, 2009 at 20:58, Nicolas Bock  wrote:
> 
> 
> On Tue, Dec 1, 2009 at 18:03, Ralph Castain  wrote:
> You may want to check your limits as defined by the shell/system. I can also 
> run this for as long as I'm willing to let it run, so something else appears 
> to be going on.
> 
> 
> 
> Is that with 1.3.3? I found that with 1.3.4 I can run the example much longer 
> until I hit this error message:
> 
> 
> [master] (31996) forking processes
> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the 
> sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998) 
> of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir 
> failed [1]
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
> util/session_dir.c at line 101
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
> util/session_dir.c at line 425
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_app.c at line 132
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value Error (-1) instead of ORTE_SUCCESS
> 
> 
> After some googling I found that this is apparently an ext3 filesystem 
> limitation, i.e. there can be only 31998 subdirectories in a directory. Why 
> is openmpi creating all of these directories in the first place? Is there a 
> way to "recycle" them?

The session directories are built to house shared memory backing files, plus 
other potential entries depending upon options. They should be deleted upon 
finalize of each process, so you shouldn't be running out of them.

I can check to see that the code is cleaning them out (or at least, attempting 
to do so). Not sure if there is something about ext3 that might retain the 
directory entries until the "parent" process terminates, even though the files 
have been deleted.

If you do an ls on the directory tree, do you see 32k subdirectories? Or do you 
only see the ones for the active processes?


> 
> nick
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-02 Thread Nicolas Bock
On Tue, Dec 1, 2009 at 20:58, Nicolas Bock  wrote:

>
>
> On Tue, Dec 1, 2009 at 18:03, Ralph Castain  wrote:
>
>> You may want to check your limits as defined by the shell/system. I can
>> also run this for as long as I'm willing to let it run, so something else
>> appears to be going on.
>>
>>
>>
> Is that with 1.3.3? I found that with 1.3.4 I can run the example much
> longer until I hit this error message:
>
>
> [master] (31996) forking processes
> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the
> sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998)
> of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0),
> mkdir failed [1]
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 101
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 425
> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
> base/ess_base_std_app.c at line 132
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   orte_session_dir failed
>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>
>
After some googling I found that this is apparently an ext3 filesystem
limitation, i.e. there can be only 31998 subdirectories in a directory. Why
is openmpi creating all of these directories in the first place? Is there a
way to "recycle" them?

nick


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
On Tue, Dec 1, 2009 at 18:03, Ralph Castain  wrote:

> You may want to check your limits as defined by the shell/system. I can
> also run this for as long as I'm willing to let it run, so something else
> appears to be going on.
>
>
>
Is that with 1.3.3? I found that with 1.3.4 I can run the example much
longer until I hit this error message:


[master] (31996) forking processes
[mujo:14273] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998)
of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir
failed [1]
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 101
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 425
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
base/ess_base_std_app.c at line 132
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS





> On Dec 1, 2009, at 4:38 PM, Nicolas Bock wrote:
>
>
>
> On Tue, Dec 1, 2009 at 16:28, Abhishek Kulkarni wrote:
>
>> On Tue, Dec 1, 2009 at 6:15 PM, Nicolas Bock 
>> wrote:
>> > After reading Anthony's question again, I am not sure now that we are
>> having
>> > the same problem, but we might. In any case, the attached example
>> programs
>> > trigger the issue of running out of pipes. I don't see how orted could,
>> even
>> > if it was reused. There is only a very limited number of processes
>> running
>> > at any given time. Once slave terminates, how would it still have open
>> > pipes? Shouldn't the total number of open files, or pipes, be very
>> limited
>> > in this situation? And yet, after maybe 20 or so iterations in master.c,
>> > orted complains about running out of pipes.
>> >
>> > nick
>> >
>>
>> What version of OMPI are you trying it with? I can easily run it up to
>> more
>> than 200 iterations.
>>
>>
> openmpi-1.3.3
>
>
>
>> Also, how many nodes are you running this on?
>>
>> This is on one node with 4 cores. I am using
>
> mpirun -np 1
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Ralph Castain
You may want to check your limits as defined by the shell/system. I can also 
run this for as long as I'm willing to let it run, so something else appears to 
be going on.


On Dec 1, 2009, at 4:38 PM, Nicolas Bock wrote:

> 
> 
> On Tue, Dec 1, 2009 at 16:28, Abhishek Kulkarni  wrote:
> On Tue, Dec 1, 2009 at 6:15 PM, Nicolas Bock  wrote:
> > After reading Anthony's question again, I am not sure now that we are having
> > the same problem, but we might. In any case, the attached example programs
> > trigger the issue of running out of pipes. I don't see how orted could, even
> > if it was reused. There is only a very limited number of processes running
> > at any given time. Once slave terminates, how would it still have open
> > pipes? Shouldn't the total number of open files, or pipes, be very limited
> > in this situation? And yet, after maybe 20 or so iterations in master.c,
> > orted complains about running out of pipes.
> >
> > nick
> >
> 
> What version of OMPI are you trying it with? I can easily run it up to more
> than 200 iterations.
> 
> 
> openmpi-1.3.3
> 
>  
> Also, how many nodes are you running this on?
> 
> This is on one node with 4 cores. I am using
> 
> mpirun -np 1
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
On Tue, Dec 1, 2009 at 16:28, Abhishek Kulkarni  wrote:

> On Tue, Dec 1, 2009 at 6:15 PM, Nicolas Bock 
> wrote:
> > After reading Anthony's question again, I am not sure now that we are
> having
> > the same problem, but we might. In any case, the attached example
> programs
> > trigger the issue of running out of pipes. I don't see how orted could,
> even
> > if it was reused. There is only a very limited number of processes
> running
> > at any given time. Once slave terminates, how would it still have open
> > pipes? Shouldn't the total number of open files, or pipes, be very
> limited
> > in this situation? And yet, after maybe 20 or so iterations in master.c,
> > orted complains about running out of pipes.
> >
> > nick
> >
>
> What version of OMPI are you trying it with? I can easily run it up to more
> than 200 iterations.
>
>
openmpi-1.3.3



> Also, how many nodes are you running this on?
>
> This is on one node with 4 cores. I am using

mpirun -np 1


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Abhishek Kulkarni
On Tue, Dec 1, 2009 at 6:15 PM, Nicolas Bock  wrote:
> After reading Anthony's question again, I am not sure now that we are having
> the same problem, but we might. In any case, the attached example programs
> trigger the issue of running out of pipes. I don't see how orted could, even
> if it was reused. There is only a very limited number of processes running
> at any given time. Once slave terminates, how would it still have open
> pipes? Shouldn't the total number of open files, or pipes, be very limited
> in this situation? And yet, after maybe 20 or so iterations in master.c,
> orted complains about running out of pipes.
>
> nick
>

What version of OMPI are you trying it with? I can easily run it up to more
than 200 iterations.

Also, how many nodes are you running this on?

>
> On Tue, Dec 1, 2009 at 16:08, Nicolas Bock  wrote:
>>
>> Hello list,
>>
>> a while back in January of this year, a user (Anthony Thevenin) had the
>> problem of running out of open pipes when he tried to use MPI_Comm_spawn a
>> few times. As I the thread his started in the mailing list archives and have
>> just joined the mailing list myself, I unfortunately can't reply to the
>> thread. "The thread was titled: Doing a lot of spawns does not work with
>> ompi 1.3 BUT works with ompi 1.2.7".
>>
>> The discussion stopped without really presenting a solution. Is the issue
>> brought up by Anthony fixed? We are running into the same problem.
>>
>> Thanks, nick
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
Linux mujo 2.6.30-gentoo-r5 #1 SMP PREEMPT Thu Sep 17 07:47:12 MDT 2009
x86_64 Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux

On Tue, Dec 1, 2009 at 16:24, Ralph Castain  wrote:

> It really does help if we have some idea what OMPI version you are talking
> about, and on what kind of platform.
>
> This issue was fixed to the best of my knowledge (not all the pipes were
> getting closed), but I would have to look and see what release might contain
> the fix...would be nice to know where to start.
>
>
> On Dec 1, 2009, at 4:15 PM, Nicolas Bock wrote:
>
> After reading Anthony's question again, I am not sure now that we are
> having the same problem, but we might. In any case, the attached example
> programs trigger the issue of running out of pipes. I don't see how orted
> could, even if it was reused. There is only a very limited number of
> processes running at any given time. Once slave terminates, how would it
> still have open pipes? Shouldn't the total number of open files, or pipes,
> be very limited in this situation? And yet, after maybe 20 or so iterations
> in master.c, orted complains about running out of pipes.
>
> nick
>
>
> On Tue, Dec 1, 2009 at 16:08, Nicolas Bock  wrote:
>
>> Hello list,
>>
>> a while back in January of this year, a user (Anthony Thevenin) had the
>> problem of running out of open pipes when he tried to use MPI_Comm_spawn a
>> few times. As I the thread his started in the mailing list archives and have
>> just joined the mailing list myself, I unfortunately can't reply to the
>> thread. "The thread was titled: Doing a lot of spawns does not work with
>> ompi 1.3 BUT works with ompi 1.2.7".
>>
>> The discussion stopped without really presenting a solution. Is the issue
>> brought up by Anthony fixed? We are running into the same problem.
>>
>> Thanks, nick
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
Sorry,

openmpi-1.3.3 compiled with gcc-4.4.2

nick


On Tue, Dec 1, 2009 at 16:24, Ralph Castain  wrote:

> It really does help if we have some idea what OMPI version you are talking
> about, and on what kind of platform.
>
> This issue was fixed to the best of my knowledge (not all the pipes were
> getting closed), but I would have to look and see what release might contain
> the fix...would be nice to know where to start.
>
>
> On Dec 1, 2009, at 4:15 PM, Nicolas Bock wrote:
>
> After reading Anthony's question again, I am not sure now that we are
> having the same problem, but we might. In any case, the attached example
> programs trigger the issue of running out of pipes. I don't see how orted
> could, even if it was reused. There is only a very limited number of
> processes running at any given time. Once slave terminates, how would it
> still have open pipes? Shouldn't the total number of open files, or pipes,
> be very limited in this situation? And yet, after maybe 20 or so iterations
> in master.c, orted complains about running out of pipes.
>
> nick
>
>
> On Tue, Dec 1, 2009 at 16:08, Nicolas Bock  wrote:
>
>> Hello list,
>>
>> a while back in January of this year, a user (Anthony Thevenin) had the
>> problem of running out of open pipes when he tried to use MPI_Comm_spawn a
>> few times. As I the thread his started in the mailing list archives and have
>> just joined the mailing list myself, I unfortunately can't reply to the
>> thread. "The thread was titled: Doing a lot of spawns does not work with
>> ompi 1.3 BUT works with ompi 1.2.7".
>>
>> The discussion stopped without really presenting a solution. Is the issue
>> brought up by Anthony fixed? We are running into the same problem.
>>
>> Thanks, nick
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Ralph Castain
It really does help if we have some idea what OMPI version you are talking 
about, and on what kind of platform.

This issue was fixed to the best of my knowledge (not all the pipes were 
getting closed), but I would have to look and see what release might contain 
the fix...would be nice to know where to start.


On Dec 1, 2009, at 4:15 PM, Nicolas Bock wrote:

> After reading Anthony's question again, I am not sure now that we are having 
> the same problem, but we might. In any case, the attached example programs 
> trigger the issue of running out of pipes. I don't see how orted could, even 
> if it was reused. There is only a very limited number of processes running at 
> any given time. Once slave terminates, how would it still have open pipes? 
> Shouldn't the total number of open files, or pipes, be very limited in this 
> situation? And yet, after maybe 20 or so iterations in master.c, orted 
> complains about running out of pipes.
> 
> nick
> 
> 
> On Tue, Dec 1, 2009 at 16:08, Nicolas Bock  wrote:
> Hello list,
> 
> a while back in January of this year, a user (Anthony Thevenin) had the 
> problem of running out of open pipes when he tried to use MPI_Comm_spawn a 
> few times. As I the thread his started in the mailing list archives and have 
> just joined the mailing list myself, I unfortunately can't reply to the 
> thread. "The thread was titled: Doing a lot of spawns does not work with ompi 
> 1.3 BUT works with ompi 1.2.7".
> 
> The discussion stopped without really presenting a solution. Is the issue 
> brought up by Anthony fixed? We are running into the same problem.
> 
> Thanks, nick
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
After reading Anthony's question again, I am not sure now that we are having
the same problem, but we might. In any case, the attached example programs
trigger the issue of running out of pipes. I don't see how orted could, even
if it was reused. There is only a very limited number of processes running
at any given time. Once slave terminates, how would it still have open
pipes? Shouldn't the total number of open files, or pipes, be very limited
in this situation? And yet, after maybe 20 or so iterations in master.c,
orted complains about running out of pipes.

nick


On Tue, Dec 1, 2009 at 16:08, Nicolas Bock  wrote:

> Hello list,
>
> a while back in January of this year, a user (Anthony Thevenin) had the
> problem of running out of open pipes when he tried to use MPI_Comm_spawn a
> few times. As I the thread his started in the mailing list archives and have
> just joined the mailing list myself, I unfortunately can't reply to the
> thread. "The thread was titled: Doing a lot of spawns does not work with
> ompi 1.3 BUT works with ompi 1.2.7".
>
> The discussion stopped without really presenting a solution. Is the issue
> brought up by Anthony fixed? We are running into the same problem.
>
> Thanks, nick
>
>
#include 
#include 
#include 

int
main (int argc, char **argv)
{
  int rank;
  int size;
  int *error_codes;
  int spawn_counter = 0;
  char *slave_argv[] = { "arg1", "arg2", 0 };
  MPI_Comm spawn;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  if (rank == 0)
  {
printf("[master] running on %i processors\n", size);

while (1)
{
  printf("[master] (%i) forking processes\n", spawn_counter++);
  error_codes = (int*) malloc(sizeof(int)*size);
  MPI_Comm_spawn("./slave", slave_argv, size, MPI_INFO_NULL, 0, MPI_COMM_SELF, &spawn, error_codes);
  printf("[master] waiting at barrier\n");
  MPI_Barrier(spawn);
  free(error_codes);
}
  }

  MPI_Finalize();
}
#include 
#include 
#include 
#include 

#define SLEEP_TIME 2

int
main (int argc, char **argv)
{
  int rank;
  int size;
  MPI_Comm spawn;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("[slave %i] sleeping for %i seconds\n", rank, SLEEP_TIME);
  sleep(SLEEP_TIME);
  printf("[slave %i] waiting at barrier\n", rank);
  MPI_Comm_get_parent(&spawn);
  MPI_Barrier(spawn);

  MPI_Finalize();
}


[OMPI users] MPI_Comm_spawn lots of times

2009-12-01 Thread Nicolas Bock
Hello list,

a while back in January of this year, a user (Anthony Thevenin) had the
problem of running out of open pipes when he tried to use MPI_Comm_spawn a
few times. As I the thread his started in the mailing list archives and have
just joined the mailing list myself, I unfortunately can't reply to the
thread. "The thread was titled: Doing a lot of spawns does not work with
ompi 1.3 BUT works with ompi 1.2.7".

The discussion stopped without really presenting a solution. Is the issue
brought up by Anthony fixed? We are running into the same problem.

Thanks, nick


Re: [OMPI users] MPI_Comm_spawn query

2009-09-26 Thread Jeff Squyres

On Sep 22, 2009, at 8:20 AM, Blesson Varghese wrote:

I am fairly new to MPI.I have a few queries regarding spawning  
processes that I am listing below:

a.   How can processes send data to a spawned process?


See the descriptions for MPI_COMM_SPAWN; you get an inter-communicator  
back from MPI_COMM_SPAWN that connects the parents and children  
processes.  Hence, you can use "normal" MPI communication calls on  
this intercommunicator to pass information between parents and children.


b.  Can any process (that is not a parent process) send data to  
a spawned process?


Not directly, no.  But you have two options:

1 use MPI's connect/accept methodology to establish new connection  
(somewhat analogous to socket connect/accept methodology).
2 creatively merge successive intercommunicators until you have an  
overlapping set of processes that contain both the processes that you  
want to be able to communicate


FWIW, I imagine that #1 would likely be easier.  See OMPI's  
MPI_Comm_spawn(3) and MPI_Comm_connect(3) man pages for a list of  
limitations, though.


c.   Can MPI_Send or MPI_Recv be used to communicate with a  
spawned process?


Yes.

d.  Would it be possible in MPI to tell which processor of a  
cluster a process should be spawned?


Look at Open MPI's MPI_Comm_spawn(3) man page for the options that we  
allow passing through the MPI_Info argument.


--
Jeff Squyres
jsquy...@cisco.com



[OMPI users] MPI_Comm_spawn query

2009-09-22 Thread Blesson Varghese
Hi,



I am fairly new to MPI.I have a few queries regarding spawning processes
that I am listing below:

a.   How can processes send data to a spawned process?

b.  Can any process (that is not a parent process) send data to a
spawned process?

c.   Can MPI_Send or MPI_Recv be used to communicate with a spawned
process?

d.  Would it be possible in MPI to tell which processor of a cluster a
process should be spawned?



Looking forward to your reply. Would much appreciate if you could please
include code snippets for the same.



Many thanks and best regards,

Blesson. 





Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Jerome BENOIT

Thanks for the info.

meanwhile I have set:

mpi_param_check = 0

in my system-wide configuation file on workers

and 


mpi_param_check = 1

on the master.

Jerome


Ralph Castain wrote:

Thanks! That does indeed help clarify.

You should also then configure OMPI with 
--disable-per-user-config-files. MPI procs will automatically look at 
the default MCA parameter file, which is probably on your master node 
(wherever mpirun was executed). However, they also look at the user's 
home directory for any user default param file and/or binary modules. So 
the home directory will again be automounted, this time by the MPI procs.


We created that option specifically to address the problem you describe. 
Hope it helps.



On Apr 16, 2009, at 8:57 AM, Jerome BENOIT wrote:


Hi,

thanks for the reply.

Ralph Castain wrote:
The orteds don't pass anything from MPI_Info to srun during a 
comm_spawn. What the orteds do is to chdir to the specified wdir 
before spawning the child process to ensure that the child has the 
correct working directory, then the orted changes back to its default 
working directory.
The orted working directory is set by the base environment. So your 
slurm arguments cause *all* orteds to use the specified directory as 
their "home base". They will then use any given wdir keyval when they 
launch their respective child processes, as described above.
As a side note, it isn't clear to me why you care about the orted's 
working directory. The orteds don't write anything there, or do 
anything with respect to their "home base" - so why would this 
matter? Or are you trying to specify the executable's path relative 
to where the orted is sitting?



Let be specific. My worker nodes are homeless: the /home directory is 
automounted
(when needed) from the master node: orteds dont write anything, but 
they keep it mounted !

The idea is to avoid this by specifying a local working directory.

Jerome




On Apr 16, 2009, at 4:02 AM, Jerome BENOIT wrote:

Hi !

finally I got it:
passing the mca key/value `"plm_slurm_args"/"--chdir /local/folder"' 
does the trick.


As a matter of fact, my code pass the MPI_Info key/value 
`"wdir"/"/local/folder"'
to MPI_Comm_spawn as well: the working directories on the nodes of 
the spawned programs
are `nodes:/local/folder' as expected, but the working directory of 
the oreted_s
is the working directory of the parent program. My guess is that the 
MPI_Info key/vale

may also be passed to `srun'.

hth,
Jerome



Jerome BENOIT wrote:

Hello Again,
Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch 
programs around,
oreted working directory on the nodes is the working directory of 
the spawnning program:

can we ask to oreted to use an another directory ?
Changing the working the directory via chdir before spawning with 
MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems to 
be imposed
by something else. As run OMPI on top of SLURM, I guess this is 
related to SLURM.

Jerome


Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Ralph Castain

Thanks! That does indeed help clarify.

You should also then configure OMPI with --disable-per-user-config- 
files. MPI procs will automatically look at the default MCA parameter  
file, which is probably on your master node (wherever mpirun was  
executed). However, they also look at the user's home directory for  
any user default param file and/or binary modules. So the home  
directory will again be automounted, this time by the MPI procs.


We created that option specifically to address the problem you  
describe. Hope it helps.



On Apr 16, 2009, at 8:57 AM, Jerome BENOIT wrote:


Hi,

thanks for the reply.

Ralph Castain wrote:
The orteds don't pass anything from MPI_Info to srun during a  
comm_spawn. What the orteds do is to chdir to the specified wdir  
before spawning the child process to ensure that the child has the  
correct working directory, then the orted changes back to its  
default working directory.
The orted working directory is set by the base environment. So your  
slurm arguments cause *all* orteds to use the specified directory  
as their "home base". They will then use any given wdir keyval when  
they launch their respective child processes, as described above.
As a side note, it isn't clear to me why you care about the orted's  
working directory. The orteds don't write anything there, or do  
anything with respect to their "home base" - so why would this  
matter? Or are you trying to specify the executable's path relative  
to where the orted is sitting?



Let be specific. My worker nodes are homeless: the /home directory  
is automounted
(when needed) from the master node: orteds dont write anything, but  
they keep it mounted !

The idea is to avoid this by specifying a local working directory.

Jerome




On Apr 16, 2009, at 4:02 AM, Jerome BENOIT wrote:

Hi !

finally I got it:
passing the mca key/value `"plm_slurm_args"/"--chdir /local/ 
folder"' does the trick.


As a matter of fact, my code pass the MPI_Info key/value `"wdir"/"/ 
local/folder"'
to MPI_Comm_spawn as well: the working directories on the nodes of  
the spawned programs
are `nodes:/local/folder' as expected, but the working directory  
of the oreted_s
is the working directory of the parent program. My guess is that  
the MPI_Info key/vale

may also be passed to `srun'.

hth,
Jerome



Jerome BENOIT wrote:

Hello Again,
Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch  
programs around,
oreted working directory on the nodes is the working directory  
of the spawnning program:

can we ask to oreted to use an another directory ?
Changing the working the directory via chdir before spawning with  
MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems  
to be imposed
by something else. As run OMPI on top of SLURM, I guess this is  
related to SLURM.

Jerome


Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Jerome BENOIT

Hi,

thanks for the reply.

Ralph Castain wrote:
The orteds don't pass anything from MPI_Info to srun during a 
comm_spawn. What the orteds do is to chdir to the specified wdir before 
spawning the child process to ensure that the child has the correct 
working directory, then the orted changes back to its default working 
directory.


The orted working directory is set by the base environment. So your 
slurm arguments cause *all* orteds to use the specified directory as 
their "home base". They will then use any given wdir keyval when they 
launch their respective child processes, as described above.


As a side note, it isn't clear to me why you care about the orted's 
working directory. The orteds don't write anything there, or do anything 
with respect to their "home base" - so why would this matter? Or are you 
trying to specify the executable's path relative to where the orted is 
sitting?



Let be specific. My worker nodes are homeless: the /home directory is 
automounted
(when needed) from the master node: orteds dont write anything, but they keep 
it mounted !
The idea is to avoid this by specifying a local working directory.

Jerome






On Apr 16, 2009, at 4:02 AM, Jerome BENOIT wrote:


Hi !

finally I got it:
passing the mca key/value `"plm_slurm_args"/"--chdir /local/folder"' 
does the trick.


As a matter of fact, my code pass the MPI_Info key/value 
`"wdir"/"/local/folder"'
to MPI_Comm_spawn as well: the working directories on the nodes of the 
spawned programs
are `nodes:/local/folder' as expected, but the working directory of 
the oreted_s
is the working directory of the parent program. My guess is that the 
MPI_Info key/vale

may also be passed to `srun'.

hth,
Jerome



Jerome BENOIT wrote:

Hello Again,
Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch 
programs around,
oreted working directory on the nodes is the working directory of 
the spawnning program:

can we ask to oreted to use an another directory ?
Changing the working the directory via chdir before spawning with 
MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems to 
be imposed
by something else. As run OMPI on top of SLURM, I guess this is 
related to SLURM.

Jerome


Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Ralph Castain
The orteds don't pass anything from MPI_Info to srun during a  
comm_spawn. What the orteds do is to chdir to the specified wdir  
before spawning the child process to ensure that the child has the  
correct working directory, then the orted changes back to its default  
working directory.


The orted working directory is set by the base environment. So your  
slurm arguments cause *all* orteds to use the specified directory as  
their "home base". They will then use any given wdir keyval when they  
launch their respective child processes, as described above.


As a side note, it isn't clear to me why you care about the orted's  
working directory. The orteds don't write anything there, or do  
anything with respect to their "home base" - so why would this matter?  
Or are you trying to specify the executable's path relative to where  
the orted is sitting?



On Apr 16, 2009, at 4:02 AM, Jerome BENOIT wrote:


Hi !

finally I got it:
passing the mca key/value `"plm_slurm_args"/"--chdir /local/folder"'  
does the trick.


As a matter of fact, my code pass the MPI_Info key/value `"wdir"/"/ 
local/folder"'
to MPI_Comm_spawn as well: the working directories on the nodes of  
the spawned programs
are `nodes:/local/folder' as expected, but the working directory of  
the oreted_s
is the working directory of the parent program. My guess is that the  
MPI_Info key/vale

may also be passed to `srun'.

hth,
Jerome



Jerome BENOIT wrote:

Hello Again,
Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch  
programs around,
oreted working directory on the nodes is the working directory of  
the spawnning program:

can we ask to oreted to use an another directory ?
Changing the working the directory via chdir before spawning with  
MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems to  
be imposed
by something else. As run OMPI on top of SLURM, I guess this is  
related to SLURM.

Jerome


Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Jerome BENOIT

Hi !

finally I got it:
passing the mca key/value `"plm_slurm_args"/"--chdir /local/folder"' does the 
trick.

As a matter of fact, my code pass the MPI_Info key/value 
`"wdir"/"/local/folder"'
to MPI_Comm_spawn as well: the working directories on the nodes of the spawned 
programs
are `nodes:/local/folder' as expected, but the working directory of the oreted_s
is the working directory of the parent program. My guess is that the MPI_Info 
key/vale
may also be passed to `srun'.

hth,
Jerome



Jerome BENOIT wrote:

Hello Again,

Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch 
programs around,
oreted working directory on the nodes is the working directory of the 
spawnning program:

can we ask to oreted to use an another directory ?


Changing the working the directory via chdir before spawning with 
MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems to be 
imposed
by something else. As run OMPI on top of SLURM, I guess this is related 
to SLURM.


Jerome



Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Jerome BENOIT

Hello Again,

Jerome BENOIT wrote:

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch programs 
around,
oreted working directory on the nodes is the working directory of the 
spawnning program:

can we ask to oreted to use an another directory ?


Changing the working the directory via chdir before spawning with MPI_Comm_spawn
changes nothing: the oreted working directory on the nodes seems to be imposed
by something else. As run OMPI on top of SLURM, I guess this is related to 
SLURM.

Jerome



Thanks in advance,
Jerome ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] MPI_Comm_spawn and oreted

2009-04-16 Thread Jerome BENOIT

Hello List,

I have just noticed that, when MPI_Comm_spawn is used to launch programs around,
oreted working directory on the nodes is the working directory of the spawnning 
program:
can we ask to oreted to use an another directory ?

Thanks in advance,
Jerome 


Re: [OMPI users] MPI_Comm_spawn errors

2008-02-19 Thread Tim Prins

Hi Joao,

Unfortunately, spawn is broken on the development trunk right now. We 
are working on a major revamp of the runtime system which should fix 
these problems, but it is not ready yet.


Sorry about that :(

Tim


Joao Vicente Lima wrote:

Hi all,
I'm getting errors with spawn in the situations:

1) spawn1.c - spawning 2 process on localhost, one by one,  the error is:

spawning ...
[localhost:31390] *** Process received signal ***
[localhost:31390] Signal: Segmentation fault (11)
[localhost:31390] Signal code: Address not mapped (1)
[localhost:31390] Failing at address: 0x98
[localhost:31390] [ 0] /lib/libpthread.so.0 [0x2b1d38a17ed0]
[localhost:31390] [ 1]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_dyn_finalize+0xd2)
[0x2b1d37667cb2]
[localhost:31390] [ 2]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_finalize+0x3b)
[0x2b1d3766358b]
[localhost:31390] [ 3]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_mpi_finalize+0x248)
[0x2b1d37679598]
[localhost:31390] [ 4] ./spawn1(main+0xac) [0x400ac4]
[localhost:31390] [ 5] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b1d38c43b74]
[localhost:31390] [ 6] ./spawn1 [0x400989]
[localhost:31390] *** End of error message ***
--
mpirun has exited due to process rank 0 with PID 31390 on
node localhost calling "abort". This will have caused other processes
in the application to be terminated by signals sent by mpirun
(as reported here).
--

With 1 process spawned or with 2 process spawned in one call there is
no output from child.

2) spawn2.c - no response, this init is
 MPI_Init_thread (&argc, &argv, MPI_THREAD_MULTIPLE, &required)

the attachments contains the programs, ompi_info and config.log.

Some suggest ?

thanks a lot.
Joao.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] MPI_Comm_spawn errors

2008-02-18 Thread Joao Vicente Lima
Hi all,
I'm getting errors with spawn in the situations:

1) spawn1.c - spawning 2 process on localhost, one by one,  the error is:

spawning ...
[localhost:31390] *** Process received signal ***
[localhost:31390] Signal: Segmentation fault (11)
[localhost:31390] Signal code: Address not mapped (1)
[localhost:31390] Failing at address: 0x98
[localhost:31390] [ 0] /lib/libpthread.so.0 [0x2b1d38a17ed0]
[localhost:31390] [ 1]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_dyn_finalize+0xd2)
[0x2b1d37667cb2]
[localhost:31390] [ 2]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_finalize+0x3b)
[0x2b1d3766358b]
[localhost:31390] [ 3]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_mpi_finalize+0x248)
[0x2b1d37679598]
[localhost:31390] [ 4] ./spawn1(main+0xac) [0x400ac4]
[localhost:31390] [ 5] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b1d38c43b74]
[localhost:31390] [ 6] ./spawn1 [0x400989]
[localhost:31390] *** End of error message ***
--
mpirun has exited due to process rank 0 with PID 31390 on
node localhost calling "abort". This will have caused other processes
in the application to be terminated by signals sent by mpirun
(as reported here).
--

With 1 process spawned or with 2 process spawned in one call there is
no output from child.

2) spawn2.c - no response, this init is
 MPI_Init_thread (&argc, &argv, MPI_THREAD_MULTIPLE, &required)

the attachments contains the programs, ompi_info and config.log.

Some suggest ?

thanks a lot.
Joao.


spawn1.c.gz
Description: GNU Zip compressed data


spawn2.c.gz
Description: GNU Zip compressed data


ompi_info.txt.gz
Description: GNU Zip compressed data


config.log.gz
Description: GNU Zip compressed data


Re: [OMPI users] MPI_Comm_Spawn

2007-04-04 Thread Ralph H Castain
 () from
> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
> #14 0x080489f3 in main (argc=1, argv=0xb8a4) at spawn6.c:33
> 
> 
> 
> /**TEST 2***/
> 
> GNU gdb 6.3-debian
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-linux"...Using host libthread_db library
> "/lib/tls/libthread_db.so.1".
> 
> (gdb) run -np 1 --host myhost spawn6
> Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np
> 1 --host myhost spawn6
> [Thread debugging using libthread_db enabled]
> [New Thread 1076121728 (LWP 4022)]
> main***
> main : Lancement MPI*
> Exe : Lance
> Exe: lRankExe  = 1   lRankMain  = 0
> 1 main***MPI_Comm_spawn return : 0
> 1 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 2 main***MPI_Comm_spawn return : 0
> 2 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> ...
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 30 main***MPI_Comm_spawn return : 0
> 30 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 31 main***MPI_Comm_spawn return : 0
> 31 main***Rang main : 0   Rang exe : 1
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1076121728 (LWP 4022)]
> 0x4018833b in strlen () from /lib/tls/libc.so.6
> (gdb) where
> #0  0x4018833b in strlen () from /lib/tls/libc.so.6
> #1  0x40297c5e in orte_gpr_replica_create_itag () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #2  0x4029d2df in orte_gpr_replica_put_fn () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #3  0x40297281 in orte_gpr_replica_put () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #4  0x40048287 in orte_ras_base_node_assign () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #5  0x400463e1 in orte_ras_base_allocate_nodes () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #6  0x402c2bb8 in orte_ras_hostfile_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
> #7  0x400464e0 in orte_ras_base_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #8  0x402b063f in orte_rmgr_urm_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #9  0x4004f277 in orte_rmgr_base_cmd_dispatch () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #10 0x402b10ae in orte_rmgr_urm_recv () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #11 0x4004301e in mca_oob_recv_callback () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #12 0x4027a748 in mca_oob_tcp_msg_data () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #14 0x400703f9 in opal_event_loop () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #15 0x4006adfa in opal_progress () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at
> condition.h:81
> #17 0x0804a4c8 in orterun (argc=6, argv=0xb854) at orterun.c:427
> #18 0x08049dd6 in main (argc=6, argv=0xb854) at main.c:13
> (gdb)
> -Message d'origine-
> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]De la
> part de Tim Prins
> Envoyé : lundi 5 mars 2007 22:34
> À : Open MPI Users
> Objet : Re: [OMPI users] MPI_Comm_Spawn
> 
> 
> Never mind, I was just able to replicate it. I'll look into it.
> 
> Tim
> 
> On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:
> 
>> That is possible. Threading support is VERY lightly tested, but I
>> doubt it is the problem since it always fails after 31 spawns.
>> 
>> Again, I have tried with these configure options and the same version
>> of Open MPI and have still have been able to replicate this (after
>> letting it spawn over 500 times). Have you been able to try a more
>> recent version of Open MPI? What kind of system is it?

Re: [OMPI users] MPI_Comm_Spawn

2007-03-13 Thread Ralph H Castain
 from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #10 0x40b7e5c9 in mca_pml_ob1_component_init () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so
>> #11 0x40094192 in mca_pml_base_select () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #12 0x4005742c in ompi_mpi_init () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #13 0x4007c182 in PMPI_Init_thread () from
>> /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
>> #14 0x080489f3 in main (argc=1, argv=0xb8a4) at spawn6.c:33
>> 
>> 
>> 
>> /**TEST 2***/
>> 
>> GNU gdb 6.3-debian
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "i386-linux"...Using host libthread_db library
>> "/lib/tls/libthread_db.so.1".
>> 
>> (gdb) run -np 1 --host myhost spawn6
>> Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun
>> -np
>> 1 --host myhost spawn6
>> [Thread debugging using libthread_db enabled]
>> [New Thread 1076121728 (LWP 4022)]
>> main***
>> main : Lancement MPI*
>> Exe : Lance
>> Exe: lRankExe  = 1   lRankMain  = 0
>> 1 main***MPI_Comm_spawn return : 0
>> 1 main***Rang main : 0   Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>> 
>> 
>> Exe: lRankExe  = 1   lRankMain  = 0
>> 2 main***MPI_Comm_spawn return : 0
>> 2 main***Rang main : 0   Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>> 
>> ...
>> 
>> Exe: lRankExe  = 1   lRankMain  = 0
>> 30 main***MPI_Comm_spawn return : 0
>> 30 main***Rang main : 0   Rang exe : 1
>> Exe : Lance
>> Exe: Fin.
>> 
>> Exe: lRankExe  = 1   lRankMain  = 0
>> 31 main***MPI_Comm_spawn return : 0
>> 31 main***Rang main : 0   Rang exe : 1
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 1076121728 (LWP 4022)]
>> 0x4018833b in strlen () from /lib/tls/libc.so.6
>> (gdb) where
>> #0  0x4018833b in strlen () from /lib/tls/libc.so.6
>> #1  0x40297c5e in orte_gpr_replica_create_itag () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #2  0x4029d2df in orte_gpr_replica_put_fn () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #3  0x40297281 in orte_gpr_replica_put () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
>> #4  0x40048287 in orte_ras_base_node_assign () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #5  0x400463e1 in orte_ras_base_allocate_nodes () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #6  0x402c2bb8 in orte_ras_hostfile_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
>> #7  0x400464e0 in orte_ras_base_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #8  0x402b063f in orte_rmgr_urm_allocate () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
>> #9  0x4004f277 in orte_rmgr_base_cmd_dispatch () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #10 0x402b10ae in orte_rmgr_urm_recv () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
>> #11 0x4004301e in mca_oob_recv_callback () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
>> #12 0x4027a748 in mca_oob_tcp_msg_data () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
>> #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
>> #14 0x400703f9 in opal_event_loop () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
>> #15 0x4006adfa in opal_progress () from
>> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
>> #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at
>> condition.h:81
>> #17 0x0804a4c8 in orterun (argc=6, argv=0xb854) at orterun.c:427
>> #18 0x08049dd6 in main (argc=6, argv=0xb854) at main.c:13
>> (gdb)
>> -Message d'origine-
>> 

Re: [OMPI users] MPI_Comm_Spawn

2007-03-06 Thread Ralph Castain
; "/lib/tls/libthread_db.so.1".
> 
> (gdb) run -np 1 --host myhost spawn6
> Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np
> 1 --host myhost spawn6
> [Thread debugging using libthread_db enabled]
> [New Thread 1076121728 (LWP 4022)]
> main***
> main : Lancement MPI*
> Exe : Lance
> Exe: lRankExe  = 1   lRankMain  = 0
> 1 main***MPI_Comm_spawn return : 0
> 1 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 2 main***MPI_Comm_spawn return : 0
> 2 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> ...
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 30 main***MPI_Comm_spawn return : 0
> 30 main***Rang main : 0   Rang exe : 1
> Exe : Lance
> Exe: Fin.
> 
> Exe: lRankExe  = 1   lRankMain  = 0
> 31 main***MPI_Comm_spawn return : 0
> 31 main***Rang main : 0   Rang exe : 1
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1076121728 (LWP 4022)]
> 0x4018833b in strlen () from /lib/tls/libc.so.6
> (gdb) where
> #0  0x4018833b in strlen () from /lib/tls/libc.so.6
> #1  0x40297c5e in orte_gpr_replica_create_itag () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #2  0x4029d2df in orte_gpr_replica_put_fn () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #3  0x40297281 in orte_gpr_replica_put () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
> #4  0x40048287 in orte_ras_base_node_assign () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #5  0x400463e1 in orte_ras_base_allocate_nodes () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #6  0x402c2bb8 in orte_ras_hostfile_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
> #7  0x400464e0 in orte_ras_base_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #8  0x402b063f in orte_rmgr_urm_allocate () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #9  0x4004f277 in orte_rmgr_base_cmd_dispatch () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #10 0x402b10ae in orte_rmgr_urm_recv () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
> #11 0x4004301e in mca_oob_recv_callback () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
> #12 0x4027a748 in mca_oob_tcp_msg_data () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
> #14 0x400703f9 in opal_event_loop () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #15 0x4006adfa in opal_progress () from
> /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
> #16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at
> condition.h:81
> #17 0x0804a4c8 in orterun (argc=6, argv=0xb854) at orterun.c:427
> #18 0x08049dd6 in main (argc=6, argv=0xb854) at main.c:13
> (gdb)
> -Message d'origine-
> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]De la
> part de Tim Prins
> Envoyé : lundi 5 mars 2007 22:34
> À : Open MPI Users
> Objet : Re: [OMPI users] MPI_Comm_Spawn
> 
> 
> Never mind, I was just able to replicate it. I'll look into it.
> 
> Tim
> 
> On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:
> 
>> That is possible. Threading support is VERY lightly tested, but I
>> doubt it is the problem since it always fails after 31 spawns.
>> 
>> Again, I have tried with these configure options and the same version
>> of Open MPI and have still have been able to replicate this (after
>> letting it spawn over 500 times). Have you been able to try a more
>> recent version of Open MPI? What kind of system is it? How many nodes
>> are you running on?
>> 
>> Tim
>> 
>> On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote:
>> 
>>> 
>>> Maybe the problem comes from the configuration options.
>>> The configuration options used are :
>>> ./configure  --enable-mpi-threads --enable-progress-threads --with-
>>> threads=posix --enable-smp-locks
>>> Could you give me your point of view about that please ?
>>> Thanks
>>> 
>>> -Message d'origine-
>>> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>> De la
>>> part de Ralp

Re: [OMPI users] MPI_Comm_Spawn

2007-03-06 Thread Rozzen . VINCONT
oThread/lib/openmpi/mca_gpr_replica.so
#2  0x4029d2df in orte_gpr_replica_put_fn () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
#3  0x40297281 in orte_gpr_replica_put () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so
#4  0x40048287 in orte_ras_base_node_assign () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
#5  0x400463e1 in orte_ras_base_allocate_nodes () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
#6  0x402c2bb8 in orte_ras_hostfile_allocate () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_ras_hostfile.so
#7  0x400464e0 in orte_ras_base_allocate () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
#8  0x402b063f in orte_rmgr_urm_allocate () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
#9  0x4004f277 in orte_rmgr_base_cmd_dispatch () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
#10 0x402b10ae in orte_rmgr_urm_recv () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_rmgr_urm.so
#11 0x4004301e in mca_oob_recv_callback () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/liborte.so.0
#12 0x4027a748 in mca_oob_tcp_msg_data () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
#13 0x4027bb12 in mca_oob_tcp_peer_recv_handler () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_oob_tcp.so
#14 0x400703f9 in opal_event_loop () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
#15 0x4006adfa in opal_progress () from 
/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/libopal.so.0
#16 0x0804c7a1 in opal_condition_wait (c=0x804fbcc, m=0x804fba8) at 
condition.h:81
#17 0x0804a4c8 in orterun (argc=6, argv=0xb854) at orterun.c:427
#18 0x08049dd6 in main (argc=6, argv=0xb854) at main.c:13
(gdb)
-Message d'origine-
De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]De la
part de Tim Prins
Envoyé : lundi 5 mars 2007 22:34
À : Open MPI Users
Objet : Re: [OMPI users] MPI_Comm_Spawn


Never mind, I was just able to replicate it. I'll look into it.

Tim

On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:

> That is possible. Threading support is VERY lightly tested, but I
> doubt it is the problem since it always fails after 31 spawns.
>
> Again, I have tried with these configure options and the same version
> of Open MPI and have still have been able to replicate this (after
> letting it spawn over 500 times). Have you been able to try a more
> recent version of Open MPI? What kind of system is it? How many nodes
> are you running on?
>
> Tim
>
> On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote:
>
>>
>> Maybe the problem comes from the configuration options.
>> The configuration options used are :
>> ./configure  --enable-mpi-threads --enable-progress-threads --with-
>> threads=posix --enable-smp-locks
>> Could you give me your point of view about that please ?
>> Thanks
>>
>> -Message d'origine-
>> De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> De la
>> part de Ralph H Castain
>> Envoyé : mardi 27 février 2007 16:26
>> À : Open MPI Users 
>> Objet : Re: [OMPI users] MPI_Comm_Spawn
>>
>>
>> Now that's interesting! There shouldn't be a limit, but to be
>> honest, I've
>> never tested that mode of operation - let me look into it and see.
>> It sounds
>> like there is some counter that is overflowing, but I'll look.
>>
>> Thanks
>> Ralph
>>
>>
>> On 2/27/07 8:15 AM, "rozzen.vinc...@fr.thalesgroup.com"
>>  wrote:
>>
>>> Do you know if there is a limit to the number of MPI_Comm_spawn we
>>> can use in
>>> order to launch a program?
>>> I want to start and stop a program several times (with the function
>>> MPI_Comm_spawn) but every time after  31 MPI_Comm_spawn, I get a
>>> "segmentation
>>> fault".
>>> Could you give me your point of you to solve this problem?
>>> Thanks
>>>
>>> /*file .c : spawned  the file Exe*/
>>> #include 
>>> #include 
>>> #include 
>>> #include "mpi.h"
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #define EXE_TEST "/home/workspace/test_spaw1/src/ 
>>> Exe"
>>>
>>>
>>>
>>> int main( int argc, char **argv ) {
>>>
>>> long *lpBufferMpi;
>>> MPI_Comm lIntercom;
>>> int lErrcode;
>>> MPI_Comm lCommunicateur;
>>> int lRangMa

  1   2   >