Re: [OMPI users] Problem with running openMPI program
Hi Ankush, You can get some example MPI programs from http://www.pdc.kth.se/training/Tutor/MPI/Templates/index-frame.html . You can compare the performance of these in a MPI (single processor, multiple processors) setting and non-MPI (serial) setting to show how it can help their research. Hope that helps, Prakash On Apr 20, 2009, at 12:34 PM, Ankush Kaul wrote: let me describe what i want to do. i had taken linux clustering as my final year engineering project as i m really iintrested in 0networking. to tell de truth our college does not have any professor with knowledge of clustering. the aim of our project was just to make a cluster, which we did. not we have to show and explain our project to the professors. so i want somethin to show them how de cluster works... some program or benchmarking s/w. hope you got the problem. and thanks again, we really appretiate you patience. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] getenv issue
Hi Ralph, Sorry that I did not come back to clean up this request. Actually, it was a stupid user (my) error. I had not included stdlib.h in my source. Sorry again and thanks for the effort. Prakash On Jan 14, 2008, at 11:12 PM, Jeff Squyres wrote: Sorry, this mail slipped by me. The most common reason that I have seen this happen is if you are not using the TM support in Open MPI to launch the MPI processes on your allocated nodes. I do not have a TM system to test with, but I *believe* that TM will replicate your entire environment (including $PBS_JOBID) out on the back-end nodes before starting the job. Are you seeing cases where this is not happening? More below. On Jan 5, 2008, at 3:48 AM, Prakash Velayutham wrote: Hi, I am trying to start a simple MPI code below using Open MPI 1.2.4 and Torque 2.2.1. prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> cat pbs.c #include #include "mpi.h" int gdb_var; void main(argc, argv) int argc; char **argv; { int rank, size, ret; gdb_var = 0; char *jobid; ret = MPI_Init(, ); if (ret != 0) printf("ERROR with MPI initialization\n"); ret = MPI_Comm_rank(MPI_COMM_WORLD, ); if (ret != 0) printf("ERROR with MPI ranking\n"); ret = MPI_Comm_size(MPI_COMM_WORLD, ); if (ret != 0) printf("ERROR with MPI sizes\n"); if (0 == rank) { printf("Host %d ready to attach\n",rank); fflush(stdout); while (0 == gdb_var) sleep(5); jobid = getenv("PBS_JOBID"); printf("Job id is %s\n", *jobid); I don't think you should be de-referncing jobid here. if (!jobid) error("PBS_JOBID not set in environment. Code must be run from a\n" " PBS script, perhaps interactively using \"qsub -I \""); } MPI_Finalize(); } main() is supposed to return an int. ;-) prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> mpiexec -np 4 -- prefix /usr/local/openmpi-1.2.4 ./pbs prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> Hmm. This output doesn't seem to match the code above...? As shown above, for some reason, PBS_JOBID is not getting set in the MPI's environment, even though it is available at the shell level. prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> echo $PBS_JOBID 18.fructose.cchmc.org Any ideas why? Thanks, Prakash ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] getenv issue
Hi, I am trying to start a simple MPI code below using Open MPI 1.2.4 and Torque 2.2.1. prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> cat pbs.c #include #include "mpi.h" int gdb_var; void main(argc, argv) int argc; char **argv; { int rank, size, ret; gdb_var = 0; char *jobid; ret = MPI_Init(, ); if (ret != 0) printf("ERROR with MPI initialization\n"); ret = MPI_Comm_rank(MPI_COMM_WORLD, ); if (ret != 0) printf("ERROR with MPI ranking\n"); ret = MPI_Comm_size(MPI_COMM_WORLD, ); if (ret != 0) printf("ERROR with MPI sizes\n"); if (0 == rank) { printf("Host %d ready to attach\n",rank); fflush(stdout); while (0 == gdb_var) sleep(5); jobid = getenv("PBS_JOBID"); printf("Job id is %s\n", *jobid); if (!jobid) error("PBS_JOBID not set in environment. Code must be run from a\n" " PBS script, perhaps interactively using \"qsub -I \""); } MPI_Finalize(); } prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> mpiexec -np 4 -- prefix /usr/local/openmpi-1.2.4 ./pbs prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> As shown above, for some reason, PBS_JOBID is not getting set in the MPI's environment, even though it is available at the shell level. prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> echo $PBS_JOBID 18.fructose.cchmc.org Any ideas why? Thanks, Prakash
Re: [OMPI users] Simple MPI_Comm_spawn program hangs
To add more info, here is a backtrace of the spawned (hung) program. (gdb) bt #0 0xe410 in __kernel_vsyscall () #1 0x402cdaec in sched_yield () from /lib/tls/libc.so.6 #2 0x4016360c in opal_progress () at runtime/opal_progress.c:301 #3 0x403a9b29 in mca_oob_tcp_msg_wait (msg=0x805cc70, rc=0xbfffba40) at oob_tcp_msg.c:108 #4 0x403b09a5 in mca_oob_tcp_recv (peer=0xbfffbba8, iov=0xbfffba88, count=1, tag=0, flags=4) at oob_tcp_recv.c:138 #5 0x40119420 in mca_oob_recv_packed (peer=0xbfffbba8, buf=0x821b200, tag=0) at base/oob_base_recv.c:69 #6 0x4003c28b in ompi_comm_allreduce_intra_oob (inbuf=0xbfffbb48, outbuf=0xbfffbb44, count=1, op=0x400d14a0, comm=0x8049d38, bridgecomm=0x0, lleader=0xbfffbc04, rleader=0xbfffbba8, send_first=1) at communicator/comm_cid.c:674 #7 0x4003adf2 in ompi_comm_nextcid (newcomm=0x807c4f8, comm=0x8049d38, bridgecomm=0x0, local_leader=0xbfffbc04, remote_leader=0xbfffbba8, mode=256, send_first=1) at communicator/ comm_cid.c:176 #8 0x4003cc2c in ompi_comm_connect_accept (comm=0x8049d38, root=0, port=0x807a5c0, send_first=1, newcomm=0xbfffbc28, tag=2000) at communicator/comm_dyn.c:208 #9 0x4003ec97 in ompi_comm_dyn_init () at communicator/comm_dyn.c:668 #10 0x4005465a in ompi_mpi_init (argc=1, argv=0xbfffbf64, requested=0, provided=0xbfffbd14) at runtime/ompi_mpi_init.c:704 #11 0x40090367 in PMPI_Init (argc=0xbfffbee0, argv=0xbfffbee4) at pinit.c:71 #12 0x08048983 in main (argc=1, argv=0xbfffbf64) at slave.c:43 (gdb) Prakash On Dec 6, 2007, at 12:08 AM, Prakash Velayutham wrote: Hi Edgar, I changed the spawned program from /bin/hostname to a very simple MPI program as below. But now, the slave hangs right at MPI_Init line. What could the issue be? slave.c #include #include #include #include "mpi.h" #include /* standard system types */ #include /* Internet address structures */ #include /* socket interface functions */ #include /* host to IP resolution */ int gdb_var; void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; MPI_Status status; MPI_Comminter_comm; gdb_var = 0; char hostname[64]; FILE *f; while (0 == gdb_var) sleep(5); gethostname(hostname, 64); MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); MPI_Comm_get_parent(_comm); MPI_Finalize(); exit(0); } Thanks, Prakash On Dec 2, 2007, at 8:36 PM, Edgar Gabriel wrote: MPI_Comm_spawn is tested nightly by the test our suites, so it should definitely work... Thanks Edgar Prakash Velayutham wrote: Thanks Edgar. I did not know that. Really? Anyways, you are sure, an MPI job will work as a spawned process instead of "hostname"? Thanks, Prakash On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote: MPI_Comm_spawn has to build an intercommunicator with the child process that it spawns. Thus, you can not spawn a non-MPI job such as /bin/hostname, since the parent process waits for some messages from the child process(es) in order to set up the intercommunicator. Thanks Edgar Prakash Velayutham wrote: Hello, Open MPI 1.2.4 I am trying to run a simple C program. ## #include #include #include #include "mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int array_of_errcodes[10]; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); printf("MASTER : spawning a slave ... \n"); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); MPI_Finalize(); exit(0); } ## This program hangs as below: prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1 MASTER : spawning a slave ... bmi-xeon1-01 Any ideas why? Thanks, Prakash ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-
Re: [OMPI users] Simple MPI_Comm_spawn program hangs
Hi Edgar, I changed the spawned program from /bin/hostname to a very simple MPI program as below. But now, the slave hangs right at MPI_Init line. What could the issue be? slave.c #include #include #include #include "mpi.h" #include /* standard system types */ #include /* Internet address structures */ #include /* socket interface functions */ #include /* host to IP resolution */ int gdb_var; void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; MPI_Status status; MPI_Comminter_comm; gdb_var = 0; char hostname[64]; FILE *f; while (0 == gdb_var) sleep(5); gethostname(hostname, 64); MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); MPI_Comm_get_parent(_comm); MPI_Finalize(); exit(0); } Thanks, Prakash On Dec 2, 2007, at 8:36 PM, Edgar Gabriel wrote: MPI_Comm_spawn is tested nightly by the test our suites, so it should definitely work... Thanks Edgar Prakash Velayutham wrote: Thanks Edgar. I did not know that. Really? Anyways, you are sure, an MPI job will work as a spawned process instead of "hostname"? Thanks, Prakash On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote: MPI_Comm_spawn has to build an intercommunicator with the child process that it spawns. Thus, you can not spawn a non-MPI job such as /bin/hostname, since the parent process waits for some messages from the child process(es) in order to set up the intercommunicator. Thanks Edgar Prakash Velayutham wrote: Hello, Open MPI 1.2.4 I am trying to run a simple C program. ## #include #include #include #include "mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int array_of_errcodes[10]; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); printf("MASTER : spawning a slave ... \n"); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); MPI_Finalize(); exit(0); } ## This program hangs as below: prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1 MASTER : spawning a slave ... bmi-xeon1-01 Any ideas why? Thanks, Prakash ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Simple MPI_Comm_spawn program hangs
Thanks Edgar. I did not know that. Really? Anyways, you are sure, an MPI job will work as a spawned process instead of "hostname"? Thanks, Prakash On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote: MPI_Comm_spawn has to build an intercommunicator with the child process that it spawns. Thus, you can not spawn a non-MPI job such as /bin/hostname, since the parent process waits for some messages from the child process(es) in order to set up the intercommunicator. Thanks Edgar Prakash Velayutham wrote: Hello, Open MPI 1.2.4 I am trying to run a simple C program. ## #include #include #include #include "mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int array_of_errcodes[10]; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); printf("MASTER : spawning a slave ... \n"); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); MPI_Finalize(); exit(0); } ## This program hangs as below: prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1 MASTER : spawning a slave ... bmi-xeon1-01 Any ideas why? Thanks, Prakash ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Simple MPI_Comm_spawn program hangs
Hello, Open MPI 1.2.4 I am trying to run a simple C program. ## #include #include #include #include "mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int array_of_errcodes[10]; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); printf("MASTER : spawning a slave ... \n"); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); MPI_Finalize(); exit(0); } ## This program hangs as below: prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1 MASTER : spawning a slave ... bmi-xeon1-01 Any ideas why? Thanks, Prakash
Re: [OMPI users] Issues running a basic program with spawn
Ralph, Please do not bother about the output containing "src is (null) and orte type is 0" in my previous email. It is just some printf I added to dss_copy.c to make some sense of what is going wrong. Prakash >>> prakash.velayut...@cchmc.org 06/05/07 6:16 AM >>> Hi, Sorry about that. Two lines got cut out from the program. Here is the full program and error messages again. No Resource Manager involved, just ssh/rsh. Hostfile contains bmi-opt2-01 bmi-opt2-02 bmi-opt2-03 bmi-opt2-04 #include #include #include #include"mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int array_of_errcodes[10]; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); printf("MASTER : spawning 3 slaves ... \n"); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); printf("MASTER : send a message to master of slaves ...\n"); MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm); MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm, ); printf("MASTER : message received : %s\n", message_1); MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm); MPI_Finalize(); exit(0); } # prakash@bmi-opt2-01:~/thesis/CS/Samples/x86_64> mpirun -np 1 --pernode --prefix /usr/local/openmpi-1.2 --hostfile machinefile ./master1 MASTER : spawning 3 slaves ... src is (null) and orte type is 0 [bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file dss/dss_copy.c at line 43 [bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file gpr_replica_put_get_fn.c at line 410 [bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_registry_fns.c at line 612 [bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 93 [bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_receive.c at line 139 mpirun: killing job... mpirun noticed that job rank 0 with PID 3532 on node bmi-opt2-01 exited on signal 15 (Terminated). Thanks, Prakash >>> r...@lanl.gov 06/03/07 9:31 PM >>> Hi Prakash Are you sure the code you provided here is the one generating the output you attached? I don't see this message anywhere in your code: MASTER : spawning 3 slaves ... and it certainly isn't anything we generate. Also, your output implies you are in some kind of loop, yet your code contains only a single comm_spawn. Could you please clarify? Thanks Ralph On 6/3/07 5:50 AM, "Prakash Velayutham" <prakash.velayut...@cchmc.org> wrote: > Hello, > > Version - Open MPI 1.2.1. > > I have a simple program as below: > > #include > #include > #include > #include"mpi.h" > > void > main(int argc, char **argv) > { > > int tag = 0; > int my_rank; > int num_proc; > charmessage_0[] = "hello slave, i'm your master"; > charmessage_1[50]; > charmaster_data[] = "slaves to work"; > int num; > MPI_Status status; > MPI_Comminter_comm; > MPI_Infoinfo; > int arr[1]; > int rc1; > MPI_Init(, ); > MPI_Comm_rank(MPI_COMM_WORLD, _rank); > MPI_Comm_size(MPI_COMM_WORLD, _proc); > rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, > MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); > printf("MASTER : send a message to master of slaves ...\n"); > MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm); > MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm, ); > printf("MASTER : message received : %s\n", message_1); > MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm); > MPI_Finalize(); > exit(0); > } > > When this is run, all I get is >> ~/thesis/CS/Samples/x86_64> mpirun -np 4 --pernode --hostfile > machinefile --prefix /usr/local/openmpi-1.2 ./master1 > MASTER : spawning 3 slaves ... > MASTER : spawning 3 slaves ... > MASTER : spawning 3
[OMPI users] Issues running a basic program with spawn
Hello, Version - Open MPI 1.2.1. I have a simple program as below: #include #include #include #include"mpi.h" void main(int argc, char **argv) { int tag = 0; int my_rank; int num_proc; charmessage_0[] = "hello slave, i'm your master"; charmessage_1[50]; charmaster_data[] = "slaves to work"; int num; MPI_Status status; MPI_Comminter_comm; MPI_Infoinfo; int arr[1]; int rc1; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, _rank); MPI_Comm_size(MPI_COMM_WORLD, _proc); rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr); printf("MASTER : send a message to master of slaves ...\n"); MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm); MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm, ); printf("MASTER : message received : %s\n", message_1); MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm); MPI_Finalize(); exit(0); } When this is run, all I get is >~/thesis/CS/Samples/x86_64> mpirun -np 4 --pernode --hostfile machinefile --prefix /usr/local/openmpi-1.2 ./master1 MASTER : spawning 3 slaves ... MASTER : spawning 3 slaves ... MASTER : spawning 3 slaves ... MASTER : spawning 3 slaves ... src is (null) and orte type is 0 [bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file dss/dss_copy.c at line 43 [bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file gpr_replica_put_get_fn.c at line 410 [bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_registry_fns.c at line 612 [bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 93 [bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_receive.c at line 139 mpirun: killing job... mpirun noticed that job rank 0 with PID 25447 on node bmi-opt2-01 exited on signal 15 (Terminated). 3 additional processes aborted (not shown) Any idea what is wrong with this. Thanks, Prakash
Re: [OMPI users] Open MPI error when using MPI_Comm_spawn
Thanks for the info, Ralph. It is as I thought, but was hoping wouldn't be that way. I am requesting more nodes from the resource manager from inside of my application code using the RM's API. when I know they are available (allocated by the RM), I am trying to split the application data across the newly allocated nodes from inside of MPI. Any ideas? Prakash >>> r...@lanl.gov 04/02/07 12:11 PM >>> The runtime underneath Open MPI (called OpenRTE) will not allow you to spawn processes on nodes outside of your allocation. This is for several reasons, but primarily because (a) we only know about the nodes that were allocated, so we have no idea how to spawn a process anywhere else, and (b) most resource managers wouldn't let us do it anyway. I gather you have some node that you know about and have hard-coded into your application? How do you know the name of the node if it isn't in your allocation?? Ralph On 4/2/07 10:05 AM, "Prakash Velayutham" <prakash.velayut...@cchmc.org> wrote: > Hello, > > I have built Open MPI (1.2) with run-time environment enabled for Torque > (2.1.6) resource manager. Initially I am requesting 4 nodes (1 CPU each) > from Torque. The from inside of my MPI code I am trying to spawn more > processes to nodes outside of Torque-assigned nodes using > MPI_Comm_spawn, but this is failing with an error below: > > [wins04:13564] *** An error occurred in MPI_Comm_spawn > [wins04:13564] *** on communicator MPI_COMM_WORLD > [wins04:13564] *** MPI_ERR_ARG: invalid argument of some other kind > [wins04:13564] *** MPI_ERRORS_ARE_FATAL (goodbye) > mpirun noticed that job rank 1 with PID 15070 on node wins03 exited on > signal 15 (Terminated). > 2 additional processes aborted (not shown) > > # > > MPI_Info info; > MPI_Comm comm, *intercomm; > ... > ... > char *key, *value; > key = "host"; > value = "wins08"; > rc1 = MPI_Info_create(); > rc1 = MPI_Info_set(info, key, value); > rc1 = MPI_Comm_spawn(slave,MPI_ARGV_NULL, 1, info, 0, > MPI_COMM_WORLD, intercomm, arr); > ... > } > > ### > > Would this work as it is or is something wrong with my assumption? Is > OpenRTE stopping me from spawning processes outside of the initially > allocated nodes through Torque? > > Thanks, > Prakash > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open MPI error when using MPI_Comm_spawn
Hello, I have built Open MPI (1.2) with run-time environment enabled for Torque (2.1.6) resource manager. Initially I am requesting 4 nodes (1 CPU each) from Torque. The from inside of my MPI code I am trying to spawn more processes to nodes outside of Torque-assigned nodes using MPI_Comm_spawn, but this is failing with an error below: [wins04:13564] *** An error occurred in MPI_Comm_spawn [wins04:13564] *** on communicator MPI_COMM_WORLD [wins04:13564] *** MPI_ERR_ARG: invalid argument of some other kind [wins04:13564] *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 1 with PID 15070 on node wins03 exited on signal 15 (Terminated). 2 additional processes aborted (not shown) # MPI_Info info; MPI_Comm comm, *intercomm; ... ... char *key, *value; key = "host"; value = "wins08"; rc1 = MPI_Info_create(); rc1 = MPI_Info_set(info, key, value); rc1 = MPI_Comm_spawn(slave,MPI_ARGV_NULL, 1, info, 0, MPI_COMM_WORLD, intercomm, arr); ... } ### Would this work as it is or is something wrong with my assumption? Is OpenRTE stopping me from spawning processes outside of the initially allocated nodes through Torque? Thanks, Prakash
[OMPI users] Spawning to processors outside of the process manager assigned nodes
Hello, I have Torque as the batch manager and Open MPI (1.0.1) as the MPI library. Initially I request for 'n' processors through Torque. After the Open MPI jobs starts, based on certain conditions, I want to acquire more processors outside of the initially assigned nodes by Torque. Is this a problem? Is this why my MPI_Comm_spawn is failing (where I say the MPI_Info element's key as "host" and value as the hostname of the new node outside of Torque's initial assignment)? Any ideas? Thanks, Prakash
Re: [OMPI users] Need help in Perl with MPI
Hello, Yes. We do this all the time. But you should understand that the MySQL database server becomes your bottle neck in this parallel environment. In our case, we run the database servers also in parallel on the scheduler assigned nodes. But this is very much application-specific. Thanks, Prakash Abhishek Pratap wrote: > Hello All > > can i execute a perl program over MPI. My program has to access Mysql > database for accessing the data during runtime. > > Is it Possible here in perl i can use Perl's Parallel::MPI (or > Parallel::MPI::Simple) but will they be able to access the mysql database > simultaneously from the server. > > Regards, > Abhishek > > On 9/29/06, Prakash Velayutham <prakash.velayut...@cchmc.org> wrote: >> >> Use Perl's Parallel::MPI (or Parallel::MPI::Simple) module. Get it from >> CPAN. Documentation should be good enough to start with. >> >> Prakash >> >> Abhishek Pratap wrote: >> > Can i execute a code written in perl over with MPI. >> > >> > My code also access a database present locally on the server. >> > >> > I am new to this field . Looking for some help >> > >> > Regards, >> > Abhishek
Re: [OMPI users] Need help in Perl with MPI
Use Perl's Parallel::MPI (or Parallel::MPI::Simple) module. Get it from CPAN. Documentation should be good enough to start with. Prakash Abhishek Pratap wrote: > Can i execute a code written in perl over with MPI. > > My code also access a database present locally on the server. > > I am new to this field . Looking for some help > > Regards, > Abhishek
Re: [OMPI users] Perl and MPI
AFAIK, both those modules work with MPI standard API and not others. The MPI::Simple I mentioned is actually Parallel::MPI::Simple. Both Parallel::MPI and Parallel::MPI::Simple are available from CPAN. Prakash imran shaik wrote: > Hi Prakash, > Do i need MPI runtime environment for sure to ue those perl modules?? > Cant i use some other clustring software.? > Where can i get MPI::Simple?? > > Imran > > >Hello, > > >My users use Parallel::MPI and MPI::Simple perl modules consistently > >without issues. But I am not sure of the support for MPI-2 standard with > >either of these modules. Is there someone here that can answer that > >question too? Also those modules seem to work only with MPICH now and > >not the other MPI distributions. > > Prakash Velayutham <prakash.velayut...@cchmc.org> wrote: Renato Golin wrote: > >> On 9/13/06, imran shaik wrote: >> >>> I need to run parallel jobs on a cluster typically of size 600 nodes and >>> running SGE, but the programmers are good at perl but not C or C++. So i >>> thought of MPI, but i dont know whether it has perl support? >>> >> Hi Imran, >> >> SGE will dispatch process among the nodes of your cluster but it does >> not support interprocess communication, which MPI does. If your >> problem is easily splittable (like parse a large apache log, read a >> large xml list of things) you might be able to split the data and >> spawn as many process as you can. >> >> I do it using LSF (another dispatcher) and a Makefile that controls >> the dependencies and spawn the processes (using make's -j flag) and it >> works quite well. But if your job need the communication (like >> processing big matrices, collecting and distributing data among >> processes etc) you'll need an interprocess communication and that's >> what MPI is best at. >> >> In a nutshell, you'll need the runtime environment to run MPI programs >> as well as you need SGE's runtime environments on every node to >> dispatch jobs and collect information. >> >> About MPI bindings for Perl, there's this module: >> http://search.cpan.org/~josh/Parallel-MPI-0.03/MPI.pm >> >> but it's far too young to be trustworthy, IMHO, and you'll probably >> need the MPI runtime on all nodes as well... >> >> cheers, >> --renato >> > Hello, > > My users use Parallel::MPI and MPI::Simple perl modules consistently > without issues. But I am not sure of the support for MPI-2 standard with > either of these modules. Is there someone here that can answer that > question too? Also those modules seem to work only with MPICH now and > not the other MPI distributions. > > Prakash
Re: [OMPI users] Perl and MPI
Renato Golin wrote: > On 9/13/06, imran shaikwrote: > >> I need to run parallel jobs on a cluster typically of size 600 nodes and >> running SGE, but the programmers are good at perl but not C or C++. So i >> thought of MPI, but i dont know whether it has perl support? >> > > Hi Imran, > > SGE will dispatch process among the nodes of your cluster but it does > not support interprocess communication, which MPI does. If your > problem is easily splittable (like parse a large apache log, read a > large xml list of things) you might be able to split the data and > spawn as many process as you can. > > I do it using LSF (another dispatcher) and a Makefile that controls > the dependencies and spawn the processes (using make's -j flag) and it > works quite well. But if your job need the communication (like > processing big matrices, collecting and distributing data among > processes etc) you'll need an interprocess communication and that's > what MPI is best at. > > In a nutshell, you'll need the runtime environment to run MPI programs > as well as you need SGE's runtime environments on every node to > dispatch jobs and collect information. > > About MPI bindings for Perl, there's this module: > http://search.cpan.org/~josh/Parallel-MPI-0.03/MPI.pm > > but it's far too young to be trustworthy, IMHO, and you'll probably > need the MPI runtime on all nodes as well... > > cheers, > --renato Hello, My users use Parallel::MPI and MPI::Simple perl modules consistently without issues. But I am not sure of the support for MPI-2 standard with either of these modules. Is there someone here that can answer that question too? Also those modules seem to work only with MPICH now and not the other MPI distributions. Prakash
Re: [OMPI users] Open MPI error
OK. Figured that it was wrong number of arguments to the code. Thanks, Prakash Jeff Squyres (jsquyres) wrote: I'm assuming that this is during the startup shortly after mpirun, right? (i.e., during MPI_INIT) It looks like MPI processes were unable to connect back to the rendezvous point (mpirun) during startup. Do you have any firewalls or port blocking running in your cluster? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham Sent: Friday, April 14, 2006 11:00 AM To: us...@open-mpi.org Cc: Prakash Velayutham Subject: [OMPI users] Open MPI error Hi All, What does this error mean? ** socket 10: [wins02:19102] [0,0,3]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 socket 12: [wins01:19281] [0,0,4]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_msg_send_handler: writev failed with errno=104 socket 6: [wins05:00939] [0,0,1] ORTE_ERROR_LOG: Communication failure in file gpr_proxy_put_get.c at line 143 socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=939) socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=939) socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=111) - retrying (pid=939) ** * I am still debugging the code I am working on, but just wanted to get some insight into where I should be looking at. I am running openmpi-1.0.1. Thanks, Prakash
Re: [OMPI users] Open MPI and Torque error
>>> prakash.velayut...@cchmc.org 04/08/06 1:42 PM >>> Hi Jeff, >>> jsquy...@cisco.com 04/08/06 7:10 AM >>> I am also curious as to why this would not work -- I was not under the impression that tm_init() would fail from a non mother-superior node...? What others say is that it will fail this way inside a Open MPI job as Open MPI's RTE is taking the only TM connection available. But the strange thing is that it works from Mother Superior without Garrick's patch (actually, regardless of the patch, the behaviour is the same, but I have not rigorously tested the patch in itself, so cannot comment about that), which I think should have failed according to the above contention. FWIW: It has been our experience with both Torque and the various flavors of PBS that you can repeatedly call tm_init() and tm_finalize() within a single process, so I would be surprised if that was the issue. Indeed, I'd have to double check, but I'm pretty sure that our MPI processes do not call tm_init() (I believe that only mpirun does). But I am running my code using mpirun, so is this expected behaviour? I am attaching my simple code below: #include #include #include extern char **environ; void do_check(int val, char *msg) { if (TM_SUCCESS != val) { printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg); exit(1); } } main (int argc, char *argv[]) { int size, rank, ret, err, numnodes, local_err; MPI_Status status; char **input; input[0] = "/bin/echo"; input[1] = "Hello There"; struct tm_roots task_root; tm_node_id *nodelist; tm_event_t event; tm_task_id task_id; char hostname[64]; char buf[]="11000"; gethostname(hostname, 64); ret = MPI_Init (, ); if (ret) { printf ("Error: %d\n", ret); return (1); } ret = MPI_Comm_size (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } ret = MPI_Comm_rank (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } printf ("First Hostname: %s node %d out of %d\n", hostname, rank, size); if (size%2 && rank==size-1) printf("Sitting out\n"); else { if (rank%2==0) MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11, MPI_COMM_WORLD); else MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11, MPI_COMM_WORLD, ); } printf ("Second Hostname: %s node %d out of %d\n", hostname, rank, size); if (rank == 1) { ret = tm_init(NULL, _root); do_check(ret, "tm_init failed"); printf ("Special Hostname: %s node %d out of %d\n", hostname, rank, size); task_id = 0xdeadbeef; event = 0xdeadbeef; printf("%s\t%s", input[0], input[1]); tm_finalize(); } MPI_Finalize (); return (0); } And the error I am getting is: First Hostname: wins05 node 0 out of 4 First Hostname: wins03 node 1 out of 4 First Hostname: wins02 node 2 out of 4 First Hostname: wins01 node 3 out of 4 Second Hostname: wins05 node 0 out of 4 Second Hostname: wins02 node 2 out of 4 Second Hostname: wins03 node 1 out of 4 Second Hostname: wins01 node 3 out of 4 tm_poll: protocol number dis error 11 ret is 17002 instead of 0: tm_init failed 3 processes killed (possibly by Open MPI) I am using Torque-2.0.0p7 and Open MPI-1.0.1. Prakash: are you running an unmodified version of Torque 2.0.0p7? I will test an unmodified version of 2.0.0p8 right now and let you know, but I am positive that is not the issue. TIA, Prakash > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham > Sent: Friday, April 07, 2006 10:13 AM > To: Open MPI Users > Cc: pak@sun.com > Subject: Re: [OMPI users] Open MPI and Torque error > > Pak Lui wrote: > > Prakash, > > > > tm_poll: protocol number dis error 11 > > ret is 17002 instead of 0: tm_init failed > > 3 processes killed (possibly by Open MPI) > > > > I encountered similar problem with OpenPBS before, which > also uses the > > TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I > tried to > > call tm_init for the second time (which in turns call tm_poll and > > returned that errno). > > > > I think what you did to start
Re: [OMPI users] Open MPI and Torque error
Hi Jeff, >>> jsquy...@cisco.com 04/08/06 7:10 AM >>> I am also curious as to why this would not work -- I was not under the impression that tm_init() would fail from a non mother-superior node...? What others say is that it will fail this way inside a Open MPI job as Open MPI's RTE is taking the only TM connection available. But the strange thing is that it works from Mother Superior without Garrick's patch (actually, regardless of the patch, the behaviour is the same, but I have not rigorously tested the patch in itself, so cannot comment about that), which I think should have failed according to the above contention. FWIW: It has been our experience with both Torque and the various flavors of PBS that you can repeatedly call tm_init() and tm_finalize() within a single process, so I would be surprised if that was the issue. Indeed, I'd have to double check, but I'm pretty sure that our MPI processes do not call tm_init() (I believe that only mpirun does). But I am running my code using mpirun, so is this expected behaviour? I am attaching my simple code below: #include #include #include extern char **environ; void do_check(int val, char *msg) { if (TM_SUCCESS != val) { printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg); exit(1); } } main (int argc, char *argv[]) { int size, rank, ret, err, numnodes, local_err; MPI_Status status; char **input; input[0] = "/bin/echo"; input[1] = "Hello There"; struct tm_roots task_root; tm_node_id *nodelist; tm_event_t event; tm_task_id task_id; char hostname[64]; char buf[]="11000"; gethostname(hostname, 64); ret = MPI_Init (, ); if (ret) { printf ("Error: %d\n", ret); return (1); } ret = MPI_Comm_size (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } ret = MPI_Comm_rank (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } printf ("First Hostname: %s node %d out of %d\n", hostname, rank, size); if (size%2 && rank==size-1) printf("Sitting out\n"); else { if (rank%2==0) MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11, MPI_COMM_WORLD); else MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11, MPI_COMM_WORLD, ); } printf ("Second Hostname: %s node %d out of %d\n", hostname, rank, size); if (rank == 1) { ret = tm_init(NULL, _root); do_check(ret, "tm_init failed"); printf ("Special Hostname: %s node %d out of %d\n", hostname, rank, size); task_id = 0xdeadbeef; event = 0xdeadbeef; printf("%s\t%s", input[0], input[1]); tm_finalize(); } MPI_Finalize (); return (0); } And the error I am getting is: First Hostname: wins05 node 0 out of 4 First Hostname: wins03 node 1 out of 4 First Hostname: wins02 node 2 out of 4 First Hostname: wins01 node 3 out of 4 Second Hostname: wins05 node 0 out of 4 Second Hostname: wins02 node 2 out of 4 Second Hostname: wins03 node 1 out of 4 Second Hostname: wins01 node 3 out of 4 tm_poll: protocol number dis error 11 ret is 17002 instead of 0: tm_init failed 3 processes killed (possibly by Open MPI) I am using Torque-2.0.0p7 and Open MPI-1.0.1. Prakash: are you running an unmodified version of Torque 2.0.0p7? I will test an unmodified version of 2.0.0p8 right now and let you know, but I am positive that is not the issue. TIA, Prakash > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham > Sent: Friday, April 07, 2006 10:13 AM > To: Open MPI Users > Cc: pak@sun.com > Subject: Re: [OMPI users] Open MPI and Torque error > > Pak Lui wrote: > > Prakash, > > > > tm_poll: protocol number dis error 11 > > ret is 17002 instead of 0: tm_init failed > > 3 processes killed (possibly by Open MPI) > > > > I encountered similar problem with OpenPBS before, which > also uses the > > TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I > tried to > > call tm_init for the second time (which in turns call tm_poll and > > returned that errno). > > > > I think what you did to start tm_init from another node and > connect to > > another mom w
Re: [OMPI users] Open MPI and Torque error
Pak Lui wrote: Prakash, tm_poll: protocol number dis error 11 ret is 17002 instead of 0: tm_init failed 3 processes killed (possibly by Open MPI) I encountered similar problem with OpenPBS before, which also uses the TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to call tm_init for the second time (which in turns call tm_poll and returned that errno). I think what you did to start tm_init from another node and connect to another mom which I do not think is allowed. The TM module in OpenMPI already called tm_init once. I am curious to know about the reason that you need to call tm_init again? If you are curious to know about the implementation for PBS, you can download the source from openpbs.org. OpenPBS source: v2.3.16/src/lib/Libifl/tm.c I am interested in getting this to work as I am working on implementing support for dynamic scheduling in Torque. I want any node in an MPI-2 job (basically Open MPI implementation) to be able to request the Torque/PBS server for more nodes. I am doing a little study in that right now. Instead of nodes talking directly to the server, I want them to be able to talk to Mother Superior and MS instead will talk to the Server. Could you please explain why this does not work now? And why it works when I do the tm_init from MS, and only does not work from any other MOM? Thanks, Prakash
[OMPI users] Open MPI and Torque error
Hi Jeff, I have a minimal MPI program to test the TM interface and strangely I seem to get errors during tm_init call. Could you explain what could be wrong? Have you seen anything similar. Here is the MPI code: #include #include #include extern char **environ; void do_check(int val, char *msg) { if (TM_SUCCESS != val) { printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg); exit(1); } } main (int argc, char *argv[]) { int size, rank, ret, err, numnodes, local_err; MPI_Status status; char **input; input[0] = "/bin/echo"; input[1] = "Hello There"; struct tm_roots task_root; tm_node_id *nodelist; tm_event_t event; tm_task_id task_id; char hostname[64]; char buf[]="11000"; gethostname(hostname, 64); ret = MPI_Init (, ); if (ret) { printf ("Error: %d\n", ret); return (1); } ret = MPI_Comm_size (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } ret = MPI_Comm_rank (MPI_COMM_WORLD, ); if (ret) { printf("Error: %d\n", ret); return (1); } printf ("First Hostname: %s node %d out of %d\n", hostname, rank, size); if (size%2 && rank==size-1) printf("Sitting out\n"); else { if (rank%2==0) MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11, MPI_COMM_WORLD); else MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11, MPI_COMM_WORLD, ); } printf ("Second Hostname: %s node %d out of %d\n", hostname, rank, size); if (rank == 1) { ret = tm_init(NULL, _root); do_check(ret, "tm_init failed"); printf ("Special Hostname: %s node %d out of %d\n", hostname, rank, size); task_id = 0xabcdef; event = 0xabcdef; printf("%s\t%s", input[0], input[1]); tm_finalize(); } MPI_Finalize (); return (0); } The error I am getting is: First Hostname: wins05 node 0 out of 4 First Hostname: wins03 node 1 out of 4 First Hostname: wins02 node 2 out of 4 First Hostname: wins01 node 3 out of 4 Second Hostname: wins05 node 0 out of 4 Second Hostname: wins02 node 2 out of 4 Second Hostname: wins03 node 1 out of 4 Second Hostname: wins01 node 3 out of 4 tm_poll: protocol number dis error 11 ret is 17002 instead of 0: tm_init failed 3 processes killed (possibly by Open MPI) I am using Torque-2.0.0p7 and Open MPI-1.0.1. Thanks, Prakash