Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Prakash Velayutham

Hi Ankush,

You can get some example MPI programs from http://www.pdc.kth.se/training/Tutor/MPI/Templates/index-frame.html 
. 

You can compare the performance of these in a MPI (single processor,  
multiple processors) setting and non-MPI (serial) setting to show how  
it can help their research.


Hope that helps,
Prakash

On Apr 20, 2009, at 12:34 PM, Ankush Kaul wrote:


let me describe what i want to do.

i had taken linux clustering as my final year engineering project as  
i m really iintrested in 0networking.


to tell de truth our college does not have any professor with  
knowledge of clustering.


the aim of our project was just to make a cluster, which we did. not  
we have to show and explain our project to the professors. so i want  
somethin to show them how de cluster works... some program or  
benchmarking s/w.


hope you got the problem.
and thanks again, we really appretiate you patience.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] getenv issue

2008-01-15 Thread Prakash Velayutham

Hi Ralph,

Sorry that I did not come back to clean up this request. Actually, it  
was a stupid user (my) error. I had not included stdlib.h in my  
source. Sorry again and thanks for the effort.


Prakash

On Jan 14, 2008, at 11:12 PM, Jeff Squyres wrote:


Sorry, this mail slipped by me.

The most common reason that I have seen this happen is if you are not
using the TM support in Open MPI to launch the MPI processes on your
allocated nodes.

I do not have a TM system to test with, but I *believe* that TM will
replicate your entire environment (including $PBS_JOBID) out on the
back-end nodes before starting the job.

Are you seeing cases where this is not happening?

More below.


On Jan 5, 2008, at 3:48 AM, Prakash Velayutham wrote:


Hi,

I am trying to start a simple MPI code below using Open MPI 1.2.4 and
Torque 2.2.1.

prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> cat pbs.c
#include 
#include "mpi.h"

int gdb_var;

void main(argc, argv)
int argc;
char **argv;
{
   int rank, size, ret;
   gdb_var = 0;
   char *jobid;
   ret = MPI_Init(, );
   if (ret != 0) printf("ERROR with MPI initialization\n");
   ret = MPI_Comm_rank(MPI_COMM_WORLD, );
   if (ret != 0) printf("ERROR with MPI ranking\n");
   ret = MPI_Comm_size(MPI_COMM_WORLD, );
   if (ret != 0) printf("ERROR with MPI sizes\n");
   if (0 == rank) {
   printf("Host %d ready to attach\n",rank);
   fflush(stdout);
   while (0 == gdb_var) sleep(5);
   jobid = getenv("PBS_JOBID");
   printf("Job id is %s\n", *jobid);



I don't think you should be de-referncing jobid here.



   if (!jobid)
   error("PBS_JOBID not set in environment.  Code must be
run from a\n"
 "  PBS script, perhaps interactively using \"qsub -I
\"");
   }
   MPI_Finalize();
}



main() is supposed to return an int.  ;-)



prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> mpiexec -np 4 --
prefix /usr/local/openmpi-1.2.4 ./pbs
prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob>



Hmm.  This output doesn't seem to match the code above...?



As shown above, for some reason, PBS_JOBID is not getting set in the
MPI's environment, even though it is available at the shell level.

prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> echo $PBS_JOBID
18.fructose.cchmc.org

Any ideas why?

Thanks,
Prakash
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] getenv issue

2008-01-05 Thread Prakash Velayutham

Hi,

I am trying to start a simple MPI code below using Open MPI 1.2.4 and  
Torque 2.2.1.


prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> cat pbs.c
#include 
#include "mpi.h"

int gdb_var;

void main(argc, argv)
int argc;
char **argv;
{
int rank, size, ret;
gdb_var = 0;
char *jobid;
ret = MPI_Init(, );
if (ret != 0) printf("ERROR with MPI initialization\n");
ret = MPI_Comm_rank(MPI_COMM_WORLD, );
if (ret != 0) printf("ERROR with MPI ranking\n");
ret = MPI_Comm_size(MPI_COMM_WORLD, );
if (ret != 0) printf("ERROR with MPI sizes\n");
if (0 == rank) {
printf("Host %d ready to attach\n",rank);
fflush(stdout);
while (0 == gdb_var) sleep(5);
jobid = getenv("PBS_JOBID");
printf("Job id is %s\n", *jobid);
if (!jobid)
error("PBS_JOBID not set in environment.  Code must be  
run from a\n"
  "  PBS script, perhaps interactively using \"qsub -I 
\"");

}
MPI_Finalize();
}

prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> mpiexec -np 4 -- 
prefix /usr/local/openmpi-1.2.4 ./pbs

prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob>

As shown above, for some reason, PBS_JOBID is not getting set in the  
MPI's environment, even though it is available at the shell level.


prakash@bmi-opt2-04:~/thesis/CS/Samples/changejob> echo $PBS_JOBID
18.fructose.cchmc.org

Any ideas why?

Thanks,
Prakash


Re: [OMPI users] Simple MPI_Comm_spawn program hangs

2007-12-06 Thread Prakash Velayutham

To add more info, here is a backtrace of the spawned (hung) program.

(gdb) bt
#0  0xe410 in __kernel_vsyscall ()
#1  0x402cdaec in sched_yield () from /lib/tls/libc.so.6
#2  0x4016360c in opal_progress () at runtime/opal_progress.c:301
#3  0x403a9b29 in mca_oob_tcp_msg_wait (msg=0x805cc70, rc=0xbfffba40)  
at oob_tcp_msg.c:108
#4  0x403b09a5 in mca_oob_tcp_recv (peer=0xbfffbba8, iov=0xbfffba88,  
count=1, tag=0, flags=4) at oob_tcp_recv.c:138
#5  0x40119420 in mca_oob_recv_packed (peer=0xbfffbba8, buf=0x821b200,  
tag=0) at base/oob_base_recv.c:69
#6  0x4003c28b in ompi_comm_allreduce_intra_oob (inbuf=0xbfffbb48,  
outbuf=0xbfffbb44, count=1, op=0x400d14a0,
comm=0x8049d38, bridgecomm=0x0, lleader=0xbfffbc04,  
rleader=0xbfffbba8, send_first=1) at communicator/comm_cid.c:674
#7  0x4003adf2 in ompi_comm_nextcid (newcomm=0x807c4f8,  
comm=0x8049d38, bridgecomm=0x0, local_leader=0xbfffbc04,
remote_leader=0xbfffbba8, mode=256, send_first=1) at communicator/ 
comm_cid.c:176
#8  0x4003cc2c in ompi_comm_connect_accept (comm=0x8049d38, root=0,  
port=0x807a5c0, send_first=1, newcomm=0xbfffbc28,

tag=2000) at communicator/comm_dyn.c:208
#9  0x4003ec97 in ompi_comm_dyn_init () at communicator/comm_dyn.c:668
#10 0x4005465a in ompi_mpi_init (argc=1, argv=0xbfffbf64, requested=0,  
provided=0xbfffbd14)

at runtime/ompi_mpi_init.c:704
#11 0x40090367 in PMPI_Init (argc=0xbfffbee0, argv=0xbfffbee4) at  
pinit.c:71

#12 0x08048983 in main (argc=1, argv=0xbfffbf64) at slave.c:43
(gdb)


Prakash


On Dec 6, 2007, at 12:08 AM, Prakash Velayutham wrote:


Hi Edgar,

I changed the spawned program from /bin/hostname to a very simple MPI
program as below. But now, the slave hangs right at MPI_Init line.
What could the issue be?

slave.c

#include 
#include 
#include 
#include "mpi.h"
#include  /* standard system types   */
#include /* Internet address structures */
#include /* socket interface functions  */
#include  /* host to IP resolution   */

int gdb_var;
void
main(int argc, char **argv)
{
int tag = 0;
int my_rank;
int num_proc;
MPI_Status  status;
MPI_Comminter_comm;

gdb_var = 0;
  char hostname[64];

   FILE *f;

while (0 == gdb_var) sleep(5);
  gethostname(hostname, 64);

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);

MPI_Comm_get_parent(_comm);

MPI_Finalize();
exit(0);
}

Thanks,
Prakash


On Dec 2, 2007, at 8:36 PM, Edgar Gabriel wrote:


MPI_Comm_spawn is tested nightly by the test our suites, so it should
definitely work...

Thanks
Edgar

Prakash Velayutham wrote:

Thanks Edgar. I did not know that. Really?

Anyways, you are sure, an MPI job will work as a spawned process
instead of "hostname"?

Thanks,
Prakash


On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote:


MPI_Comm_spawn has to build an intercommunicator with the child
process
that it spawns. Thus, you can not spawn a non-MPI job such as
/bin/hostname, since the parent process waits for some messages  
from

the
child process(es) in order to set up the intercommunicator.

Thanks
Edgar

Prakash Velayutham wrote:

Hello,

Open MPI 1.2.4

I am trying to run a simple C program.

##

#include 
#include 
#include 
#include "mpi.h"

void
main(int argc, char **argv)
{

  int tag = 0;
  int my_rank;
  int num_proc;
  charmessage_0[] = "hello slave, i'm your
master";
  charmessage_1[50];
  charmaster_data[] = "slaves to work";
  int array_of_errcodes[10];
  int num;
  MPI_Status  status;
  MPI_Comminter_comm;
  MPI_Infoinfo;
  int arr[1];
  int rc1;

  MPI_Init(, );
  MPI_Comm_rank(MPI_COMM_WORLD, _rank);
  MPI_Comm_size(MPI_COMM_WORLD, _proc);

  printf("MASTER : spawning a slave ... \n");
  rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);

  MPI_Finalize();
  exit(0);
}

##


This program hangs as below:

prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1
MASTER : spawning a slave ...
bmi-xeon1-01

Any ideas  why?

Thanks,
Prakash
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-

Re: [OMPI users] Simple MPI_Comm_spawn program hangs

2007-12-06 Thread Prakash Velayutham

Hi Edgar,

I changed the spawned program from /bin/hostname to a very simple MPI  
program as below. But now, the slave hangs right at MPI_Init line.  
What could the issue be?


slave.c

#include 
#include 
#include 
#include "mpi.h"
#include  /* standard system types   */
#include /* Internet address structures */
#include /* socket interface functions  */
#include  /* host to IP resolution   */

int gdb_var;
void
main(int argc, char **argv)
{
int tag = 0;
int my_rank;
int num_proc;
MPI_Status  status;
MPI_Comminter_comm;

gdb_var = 0;
  char hostname[64];

   FILE *f;

while (0 == gdb_var) sleep(5);
  gethostname(hostname, 64);

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);

MPI_Comm_get_parent(_comm);

MPI_Finalize();
exit(0);
}

Thanks,
Prakash


On Dec 2, 2007, at 8:36 PM, Edgar Gabriel wrote:


MPI_Comm_spawn is tested nightly by the test our suites, so it should
definitely work...

Thanks
Edgar

Prakash Velayutham wrote:

Thanks Edgar. I did not know that. Really?

Anyways, you are sure, an MPI job will work as a spawned process
instead of "hostname"?

Thanks,
Prakash


On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote:


MPI_Comm_spawn has to build an intercommunicator with the child
process
that it spawns. Thus, you can not spawn a non-MPI job such as
/bin/hostname, since the parent process waits for some messages from
the
child process(es) in order to set up the intercommunicator.

Thanks
Edgar

Prakash Velayutham wrote:

Hello,

Open MPI 1.2.4

I am trying to run a simple C program.

##

#include 
#include 
#include 
#include "mpi.h"

void
main(int argc, char **argv)
{

   int tag = 0;
   int my_rank;
   int num_proc;
   charmessage_0[] = "hello slave, i'm your  
master";

   charmessage_1[50];
   charmaster_data[] = "slaves to work";
   int array_of_errcodes[10];
   int num;
   MPI_Status  status;
   MPI_Comminter_comm;
   MPI_Infoinfo;
   int arr[1];
   int rc1;

   MPI_Init(, );
   MPI_Comm_rank(MPI_COMM_WORLD, _rank);
   MPI_Comm_size(MPI_COMM_WORLD, _proc);

   printf("MASTER : spawning a slave ... \n");
   rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);

   MPI_Finalize();
   exit(0);
}

##


This program hangs as below:

prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1
MASTER : spawning a slave ...
bmi-xeon1-01

Any ideas  why?

Thanks,
Prakash
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Simple MPI_Comm_spawn program hangs

2007-12-01 Thread Prakash Velayutham

Thanks Edgar. I did not know that. Really?

Anyways, you are sure, an MPI job will work as a spawned process  
instead of "hostname"?


Thanks,
Prakash


On Dec 1, 2007, at 5:56 PM, Edgar Gabriel wrote:

MPI_Comm_spawn has to build an intercommunicator with the child  
process

that it spawns. Thus, you can not spawn a non-MPI job such as
/bin/hostname, since the parent process waits for some messages from  
the

child process(es) in order to set up the intercommunicator.

Thanks
Edgar

Prakash Velayutham wrote:

Hello,

Open MPI 1.2.4

I am trying to run a simple C program.

##

#include 
#include 
#include 
#include "mpi.h"

void
main(int argc, char **argv)
{

int tag = 0;
int my_rank;
int num_proc;
charmessage_0[] = "hello slave, i'm your master";
charmessage_1[50];
charmaster_data[] = "slaves to work";
int array_of_errcodes[10];
int num;
MPI_Status  status;
MPI_Comminter_comm;
MPI_Infoinfo;
int arr[1];
int rc1;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);

printf("MASTER : spawning a slave ... \n");
rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);

MPI_Finalize();
exit(0);
}

##


This program hangs as below:

prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1
MASTER : spawning a slave ...
bmi-xeon1-01

Any ideas  why?

Thanks,
Prakash
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Simple MPI_Comm_spawn program hangs

2007-12-01 Thread Prakash Velayutham

Hello,

Open MPI 1.2.4

I am trying to run a simple C program.

##

#include 
#include 
#include 
#include "mpi.h"

void
main(int argc, char **argv)
{

int tag = 0;
int my_rank;
int num_proc;
charmessage_0[] = "hello slave, i'm your master";
charmessage_1[50];
charmaster_data[] = "slaves to work";
int array_of_errcodes[10];
int num;
MPI_Status  status;
MPI_Comminter_comm;
MPI_Infoinfo;
int arr[1];
int rc1;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);

printf("MASTER : spawning a slave ... \n");
rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,  
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);


MPI_Finalize();
exit(0);
}

##


This program hangs as below:

prakash@bmi-xeon1-01:~/thesis/CS/Samples> ./master1
MASTER : spawning a slave ...
bmi-xeon1-01

Any ideas  why?

Thanks,
Prakash


Re: [OMPI users] Issues running a basic program with spawn

2007-06-05 Thread Prakash Velayutham
Ralph,

Please do not bother about the output containing "src is (null) and orte
type is 0" in my previous email. It is just some printf I added to
dss_copy.c to make some sense of what is going wrong.

Prakash

>>> prakash.velayut...@cchmc.org 06/05/07 6:16 AM >>>
Hi,

Sorry about that. Two lines got cut out from the program. Here is the
full program and error messages again. No Resource Manager involved,
just ssh/rsh.

Hostfile contains

bmi-opt2-01
bmi-opt2-02
bmi-opt2-03
bmi-opt2-04


#include
#include
#include
#include"mpi.h"

void
main(int argc, char **argv)
{
int tag = 0;
int my_rank;
int num_proc;
charmessage_0[] = "hello slave, i'm your master";
charmessage_1[50];
charmaster_data[] = "slaves to work";
int array_of_errcodes[10];
int num;
MPI_Status  status;
MPI_Comminter_comm;
MPI_Infoinfo;
int arr[1];
int rc1;
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);
printf("MASTER : spawning 3 slaves ... \n");
rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);
printf("MASTER : send a message to master of slaves ...\n");
MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm);
MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm, );
printf("MASTER : message received : %s\n", message_1);
MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm);
MPI_Finalize();
exit(0);
}
#

prakash@bmi-opt2-01:~/thesis/CS/Samples/x86_64> mpirun -np 1 --pernode
--prefix /usr/local/openmpi-1.2 --hostfile machinefile ./master1
MASTER : spawning 3 slaves ... 
src is (null) and orte type is 0
[bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
dss/dss_copy.c at line 43
[bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
gpr_replica_put_get_fn.c at line 410
[bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_registry_fns.c at line 612
[bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 93
[bmi-opt2-01:03527] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_receive.c at line 139
mpirun: killing job...

mpirun noticed that job rank 0 with PID 3532 on node bmi-opt2-01 exited
on signal 15 (Terminated). 

Thanks,
Prakash

>>> r...@lanl.gov 06/03/07 9:31 PM >>>
Hi Prakash

Are you sure the code you provided here is the one generating the output
you
attached? I don't see this message anywhere in your code:

MASTER : spawning 3 slaves ...

and it certainly isn't anything we generate. Also, your output implies
you
are in some kind of loop, yet your code contains only a single
comm_spawn.

Could you please clarify?

Thanks
Ralph


On 6/3/07 5:50 AM, "Prakash Velayutham" <prakash.velayut...@cchmc.org>
wrote:

> Hello,
> 
> Version - Open MPI 1.2.1.
> 
> I have a simple program as below:
> 
> #include
> #include
> #include
> #include"mpi.h"
> 
> void
> main(int argc, char **argv)
> {
> 
> int tag = 0;
> int my_rank;
> int num_proc;
> charmessage_0[] = "hello slave, i'm your master";
> charmessage_1[50];
> charmaster_data[] = "slaves to work";
> int num;
> MPI_Status  status;
> MPI_Comminter_comm;
> MPI_Infoinfo;
> int arr[1];
> int rc1;
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, _rank);
> MPI_Comm_size(MPI_COMM_WORLD, _proc);
> rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
> MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);
> printf("MASTER : send a message to master of slaves ...\n");
> MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm);
> MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm,
);
> printf("MASTER : message received : %s\n", message_1);
> MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm);
> MPI_Finalize();
> exit(0);
> }
> 
> When this is run, all I get is
>> ~/thesis/CS/Samples/x86_64> mpirun -np 4 --pernode --hostfile
> machinefile --prefix /usr/local/openmpi-1.2 ./master1
> MASTER : spawning 3 slaves ...
> MASTER : spawning 3 slaves ...
> MASTER : spawning 3

[OMPI users] Issues running a basic program with spawn

2007-06-03 Thread Prakash Velayutham
Hello,

Version - Open MPI 1.2.1.

I have a simple program as below:

#include
#include
#include
#include"mpi.h"

void
main(int argc, char **argv)
{

int tag = 0;
int my_rank;
int num_proc;
charmessage_0[] = "hello slave, i'm your master";
charmessage_1[50];
charmaster_data[] = "slaves to work";
int num;
MPI_Status  status;
MPI_Comminter_comm;
MPI_Infoinfo;
int arr[1];
int rc1;
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, _rank);
MPI_Comm_size(MPI_COMM_WORLD, _proc);
rc1 = MPI_Comm_spawn("/bin/hostname", MPI_ARGV_NULL, 1,
MPI_INFO_NULL, 0, MPI_COMM_WORLD, _comm, arr);
printf("MASTER : send a message to master of slaves ...\n");
MPI_Send(message_0, 50, MPI_CHAR,0 , tag, inter_comm);
MPI_Recv(message_1, 50, MPI_CHAR, 0, tag, inter_comm, );
printf("MASTER : message received : %s\n", message_1);
MPI_Send(master_data, 50, MPI_CHAR,0 , tag, inter_comm);
MPI_Finalize();
exit(0);
}

When this is run, all I get is 
>~/thesis/CS/Samples/x86_64> mpirun -np 4 --pernode --hostfile
machinefile --prefix /usr/local/openmpi-1.2 ./master1
MASTER : spawning 3 slaves ... 
MASTER : spawning 3 slaves ... 
MASTER : spawning 3 slaves ... 
MASTER : spawning 3 slaves ... 
src is (null) and orte type is 0
[bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
dss/dss_copy.c at line 43
[bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
gpr_replica_put_get_fn.c at line 410
[bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_registry_fns.c at line 612
[bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 93
[bmi-opt2-01:25441] [0,0,0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_receive.c at line 139
mpirun: killing job...

mpirun noticed that job rank 0 with PID 25447 on node bmi-opt2-01 exited
on signal 15 (Terminated). 
3 additional processes aborted (not shown)

Any idea what is wrong with this.

Thanks,
Prakash


Re: [OMPI users] Open MPI error when using MPI_Comm_spawn

2007-04-02 Thread Prakash Velayutham
Thanks for the info, Ralph. It is as I thought, but was hoping wouldn't
be that way.
I am requesting more nodes from the resource manager from inside of my
application code using the RM's API. when I know they are available
(allocated by the RM), I am trying to split the application data across
the newly allocated nodes from inside of MPI.

Any ideas?

Prakash

>>> r...@lanl.gov 04/02/07 12:11 PM >>>
The runtime underneath Open MPI (called OpenRTE) will not allow you to
spawn
processes on nodes outside of your allocation. This is for several
reasons,
but primarily because (a) we only know about the nodes that were
allocated,
so we have no idea how to spawn a process anywhere else, and (b) most
resource managers wouldn't let us do it anyway.

I gather you have some node that you know about and have hard-coded into
your application? How do you know the name of the node if it isn't in
your
allocation??

Ralph


On 4/2/07 10:05 AM, "Prakash Velayutham" <prakash.velayut...@cchmc.org>
wrote:

> Hello,
> 
> I have built Open MPI (1.2) with run-time environment enabled for
Torque
> (2.1.6) resource manager. Initially I am requesting 4 nodes (1 CPU
each)
> from Torque. The from inside of my MPI code I am trying to spawn more
> processes to nodes outside of Torque-assigned nodes using
> MPI_Comm_spawn, but this is failing with an error below:
> 
> [wins04:13564] *** An error occurred in MPI_Comm_spawn
> [wins04:13564] *** on communicator MPI_COMM_WORLD
> [wins04:13564] *** MPI_ERR_ARG: invalid argument of some other kind
> [wins04:13564] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpirun noticed that job rank 1 with PID 15070 on node wins03 exited on
> signal 15 (Terminated).
> 2 additional processes aborted (not shown)
> 
> #
> 
> MPI_Info info;
> MPI_Comm comm, *intercomm;
> ...
> ...
> char *key, *value;
> key = "host";
> value = "wins08";
> rc1 = MPI_Info_create();
> rc1 = MPI_Info_set(info, key, value);
> rc1 = MPI_Comm_spawn(slave,MPI_ARGV_NULL, 1, info, 0,
> MPI_COMM_WORLD, intercomm, arr);
> ...
> }
> 
> ###
> 
> Would this work as it is or is something wrong with my assumption? Is
> OpenRTE stopping me from spawning processes outside of the initially
> allocated nodes through Torque?
> 
> Thanks,
> Prakash
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Open MPI error when using MPI_Comm_spawn

2007-04-02 Thread Prakash Velayutham
Hello,

I have built Open MPI (1.2) with run-time environment enabled for Torque
(2.1.6) resource manager. Initially I am requesting 4 nodes (1 CPU each)
from Torque. The from inside of my MPI code I am trying to spawn more
processes to nodes outside of Torque-assigned nodes using
MPI_Comm_spawn, but this is failing with an error below:

[wins04:13564] *** An error occurred in MPI_Comm_spawn
[wins04:13564] *** on communicator MPI_COMM_WORLD
[wins04:13564] *** MPI_ERR_ARG: invalid argument of some other kind
[wins04:13564] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 1 with PID 15070 on node wins03 exited on
signal 15 (Terminated). 
2 additional processes aborted (not shown)

#

MPI_Info info;
MPI_Comm comm, *intercomm;
...
...
char *key, *value;
key = "host";
value = "wins08";
rc1 = MPI_Info_create();
rc1 = MPI_Info_set(info, key, value);
rc1 = MPI_Comm_spawn(slave,MPI_ARGV_NULL, 1, info, 0,
MPI_COMM_WORLD, intercomm, arr);
...
}

###

Would this work as it is or is something wrong with my assumption? Is
OpenRTE stopping me from spawning processes outside of the initially
allocated nodes through Torque?

Thanks,
Prakash



[OMPI users] Spawning to processors outside of the process manager assigned nodes

2007-03-30 Thread Prakash Velayutham
Hello,

I have Torque as the batch manager and Open MPI (1.0.1) as the MPI
library. Initially I request for 'n' processors through Torque. After
the Open MPI jobs starts, based on certain conditions, I want to acquire
more processors outside of the initially assigned nodes by Torque. Is
this a problem? Is this why my MPI_Comm_spawn is failing (where I say
the MPI_Info element's key as "host" and value as the hostname of the
new node outside of Torque's initial assignment)?

Any ideas?

Thanks,
Prakash


Re: [OMPI users] Need help in Perl with MPI

2006-09-29 Thread Prakash Velayutham
Hello,

Yes. We do this all the time. But you should understand that the MySQL
database server becomes your bottle neck in this parallel environment.
In our case, we run the database servers also in parallel on the
scheduler assigned nodes. But this is very much application-specific.

Thanks,
Prakash

Abhishek Pratap wrote:
> Hello All
>
> can i execute a perl program over MPI. My program has to access Mysql
> database for accessing the data during runtime.
>
> Is it Possible here  in perl i can use Perl's Parallel::MPI (or
> Parallel::MPI::Simple) but will they be able to access the mysql database
> simultaneously from  the server.
>
> Regards,
> Abhishek
>
> On 9/29/06, Prakash Velayutham <prakash.velayut...@cchmc.org> wrote:
>>
>> Use Perl's Parallel::MPI (or Parallel::MPI::Simple) module. Get it from
>> CPAN. Documentation should be good enough to start with.
>>
>> Prakash
>>
>> Abhishek Pratap wrote:
>> > Can i execute a code written in perl over with MPI.
>> >
>> > My code also access a database present locally on the server.
>> >
>> > I am  new to this field . Looking for some help
>> >
>> > Regards,
>> > Abhishek 


Re: [OMPI users] Need help in Perl with MPI

2006-09-29 Thread Prakash Velayutham
Use Perl's Parallel::MPI (or Parallel::MPI::Simple) module. Get it from
CPAN. Documentation should be good enough to start with.

Prakash

Abhishek Pratap wrote:
> Can i execute a code written in perl over with MPI.
>
> My code also access a database present locally on the server.
>
> I am  new to this field . Looking for some help
>
> Regards,
> Abhishek 


Re: [OMPI users] Perl and MPI

2006-09-15 Thread Prakash Velayutham
AFAIK, both those modules work with MPI standard API and not others. The
MPI::Simple I mentioned is actually Parallel::MPI::Simple. Both
Parallel::MPI and Parallel::MPI::Simple are available from CPAN.

Prakash

imran shaik wrote:
> Hi Prakash,
>   Do i need MPI runtime environment for sure to ue those perl modules??
>   Cant i use some other clustring software.?
>   Where can i get MPI::Simple??
>   
>   Imran
>   
>   >Hello,
>   
>   >My users use Parallel::MPI and MPI::Simple perl modules consistently
>   >without issues. But I am not sure of the support for MPI-2 standard with
>   >either of these modules. Is there someone here that can answer that
>   >question too? Also those modules seem to work only with MPICH now and
>   >not the other MPI distributions.
>
> Prakash Velayutham <prakash.velayut...@cchmc.org> wrote:  Renato Golin wrote:
>   
>> On 9/13/06, imran shaik  wrote:
>> 
>>>  I need to run parallel jobs on a cluster typically of size 600 nodes and
>>> running SGE, but the programmers are good at perl but not C or C++. So i
>>> thought of MPI, but i dont know whether it has perl support?
>>>   
>> Hi Imran,
>>
>> SGE will dispatch process among the nodes of your cluster but it does
>> not support interprocess communication, which MPI does. If your
>> problem is easily splittable (like parse a large apache log, read a
>> large xml list of things) you might be able to split the data and
>> spawn as many process as you can.
>>
>> I do it using LSF (another dispatcher) and a Makefile that controls
>> the dependencies and spawn the processes (using make's -j flag) and it
>> works quite well. But if your job need the communication (like
>> processing big matrices, collecting and distributing data among
>> processes etc) you'll need an interprocess communication and that's
>> what MPI is best at.
>>
>> In a nutshell, you'll need the runtime environment to run MPI programs
>> as well as you need SGE's runtime environments on every node to
>> dispatch jobs and collect information.
>>
>> About MPI bindings for Perl, there's this module:
>> http://search.cpan.org/~josh/Parallel-MPI-0.03/MPI.pm
>>
>> but it's far too young to be trustworthy, IMHO, and you'll probably
>> need the MPI runtime on all nodes as well...
>>
>> cheers,
>> --renato
>> 
> Hello,
>
> My users use Parallel::MPI and MPI::Simple perl modules consistently
> without issues. But I am not sure of the support for MPI-2 standard with
> either of these modules. Is there someone here that can answer that
> question too? Also those modules seem to work only with MPICH now and
> not the other MPI distributions.
>
> Prakash


Re: [OMPI users] Perl and MPI

2006-09-13 Thread Prakash Velayutham
Renato Golin wrote:
> On 9/13/06, imran shaik  wrote:
>   
>>  I need to run parallel jobs on a cluster typically of size 600 nodes and
>> running SGE, but the programmers are good at perl but not C or C++. So i
>> thought of MPI, but i dont know whether it has perl support?
>> 
>
> Hi Imran,
>
> SGE will dispatch process among the nodes of your cluster but it does
> not support interprocess communication, which MPI does. If your
> problem is easily splittable (like parse a large apache log, read a
> large xml list of things) you might be able to split the data and
> spawn as many process as you can.
>
> I do it using LSF (another dispatcher) and a Makefile that controls
> the dependencies and spawn the processes (using make's -j flag) and it
> works quite well. But if your job need the communication (like
> processing big matrices, collecting and distributing data among
> processes etc) you'll need an interprocess communication and that's
> what MPI is best at.
>
> In a nutshell, you'll need the runtime environment to run MPI programs
> as well as you need SGE's runtime environments on every node to
> dispatch jobs and collect information.
>
> About MPI bindings for Perl, there's this module:
> http://search.cpan.org/~josh/Parallel-MPI-0.03/MPI.pm
>
> but it's far too young to be trustworthy, IMHO, and you'll probably
> need the MPI runtime on all nodes as well...
>
> cheers,
> --renato
Hello,

My users use Parallel::MPI and MPI::Simple perl modules consistently
without issues. But I am not sure of the support for MPI-2 standard with
either of these modules. Is there someone here that can answer that
question too? Also those modules seem to work only with MPICH now and
not the other MPI distributions.

Prakash


Re: [OMPI users] Open MPI error

2006-04-14 Thread Prakash Velayutham

OK. Figured that it was wrong number of arguments to the code.

Thanks,
Prakash

Jeff Squyres (jsquyres) wrote:

I'm assuming that this is during the startup shortly after mpirun,
right?  (i.e., during MPI_INIT)

It looks like MPI processes were unable to connect back to the
rendezvous point (mpirun) during startup.  Do you have any firewalls or
port blocking running in your cluster?
 

  

-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham

Sent: Friday, April 14, 2006 11:00 AM
To: us...@open-mpi.org
Cc: Prakash Velayutham
Subject: [OMPI users] Open MPI error

Hi All,

What does this error mean?

**

socket 10: [wins02:19102] [0,0,3]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with errno=104
socket 12: [wins01:19281] [0,0,4]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with errno=104
socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_msg_send_handler:
writev failed with errno=104
socket 6: [wins05:00939] [0,0,1] ORTE_ERROR_LOG: Communication failure
in file gpr_proxy_put_get.c at line 143
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
**
*

I am still debugging the code I am working on, but just wanted to get
some insight into where I should be looking at.

I am running openmpi-1.0.1.

Thanks,
Prakash


Re: [OMPI users] Open MPI and Torque error

2006-04-08 Thread Prakash Velayutham
>>> prakash.velayut...@cchmc.org 04/08/06 1:42 PM >>>
Hi Jeff,

>>> jsquy...@cisco.com 04/08/06 7:10 AM >>>
I am also curious as to why this would not work -- I was not under the
impression that tm_init() would fail from a non mother-superior node...?

What others say is that it will fail this way inside a Open MPI job as
Open MPI's RTE is taking the only TM connection available. But the
strange thing is that it works from Mother Superior without Garrick's
patch (actually, regardless of the patch, the behaviour is the same, but
I have not rigorously tested the patch in itself, so cannot comment
about that), which I think should have failed according to the above
contention.

FWIW: It has been our experience with both Torque and the various
flavors of PBS that you can repeatedly call tm_init() and tm_finalize()
within a single process, so I would be surprised if that was the issue.
Indeed, I'd have to double check, but I'm pretty sure that our MPI
processes do not call tm_init() (I believe that only mpirun does).

But I am running my code using mpirun, so is this expected behaviour? I
am attaching my simple code below:

#include 
#include 
#include 

extern char **environ;

void do_check(int val, char *msg) {
if (TM_SUCCESS != val) {
printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS,
msg);
exit(1);
}
}

main (int argc, char *argv[]) {
int size, rank, ret, err, numnodes, local_err;
MPI_Status status;
char **input;
input[0] = "/bin/echo";
input[1] = "Hello There";
struct tm_roots task_root;
tm_node_id *nodelist;
tm_event_t event;
tm_task_id task_id;

char hostname[64];
char
buf[]="11000";

gethostname(hostname, 64);
ret = MPI_Init (, );
if (ret) {
printf ("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_size (MPI_COMM_WORLD, );
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_rank (MPI_COMM_WORLD, );
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
printf ("First Hostname: %s node %d out of %d\n", hostname,
rank, size);
if (size%2 && rank==size-1)
printf("Sitting out\n");
else {
if (rank%2==0)
MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11,
MPI_COMM_WORLD);
else
MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11,
MPI_COMM_WORLD, );
}
printf ("Second Hostname: %s node %d out of %d\n", hostname,
rank, size);

if (rank == 1) {
ret = tm_init(NULL, _root);
do_check(ret, "tm_init failed");
printf ("Special Hostname: %s node %d out of %d\n",
hostname, rank, size);
task_id = 0xdeadbeef;
event = 0xdeadbeef;
printf("%s\t%s", input[0], input[1]);

tm_finalize();
}

MPI_Finalize ();

return (0);
}

And the error I am getting is:

First Hostname: wins05 node 0 out of 4
First Hostname: wins03 node 1 out of 4
First Hostname: wins02 node 2 out of 4
First Hostname: wins01 node 3 out of 4
Second Hostname: wins05 node 0 out of 4
Second Hostname: wins02 node 2 out of 4
Second Hostname: wins03 node 1 out of 4
Second Hostname: wins01 node 3 out of 4
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I am using Torque-2.0.0p7 and Open MPI-1.0.1.

Prakash: are you running an unmodified version of Torque 2.0.0p7?

I will test an unmodified version of 2.0.0p8 right now and let you know,
but I am positive that is not the issue.


TIA,
Prakash

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham
> Sent: Friday, April 07, 2006 10:13 AM
> To: Open MPI Users
> Cc: pak@sun.com
> Subject: Re: [OMPI users] Open MPI and Torque error
> 
> Pak Lui wrote:
> > Prakash,
> >
> > tm_poll: protocol number dis error 11
> > ret is 17002 instead of 0: tm_init failed
> > 3 processes killed (possibly by Open MPI)
> >
> > I encountered similar problem with OpenPBS before, which 
> also uses the 
> > TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I 
> tried to 
> > call tm_init for the second time (which in turns call tm_poll and 
> > returned that errno).
> >
> > I think what you did to start 

Re: [OMPI users] Open MPI and Torque error

2006-04-08 Thread Prakash Velayutham
Hi Jeff,

>>> jsquy...@cisco.com 04/08/06 7:10 AM >>>
I am also curious as to why this would not work -- I was not under the
impression that tm_init() would fail from a non mother-superior node...?

What others say is that it will fail this way inside a Open MPI job as
Open MPI's RTE is taking the only TM connection available. But the
strange thing is that it works from Mother Superior without Garrick's
patch (actually, regardless of the patch, the behaviour is the same, but
I have not rigorously tested the patch in itself, so cannot comment
about that), which I think should have failed according to the above
contention.

FWIW: It has been our experience with both Torque and the various
flavors of PBS that you can repeatedly call tm_init() and tm_finalize()
within a single process, so I would be surprised if that was the issue.
Indeed, I'd have to double check, but I'm pretty sure that our MPI
processes do not call tm_init() (I believe that only mpirun does).

But I am running my code using mpirun, so is this expected behaviour? I
am attaching my simple code below:

#include 
#include 
#include 

extern char **environ;

void do_check(int val, char *msg) {
if (TM_SUCCESS != val) {
printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS,
msg);
exit(1);
}
}

main (int argc, char *argv[]) {
int size, rank, ret, err, numnodes, local_err;
MPI_Status status;
char **input;
input[0] = "/bin/echo";
input[1] = "Hello There";
struct tm_roots task_root;
tm_node_id *nodelist;
tm_event_t event;
tm_task_id task_id;

char hostname[64];
char
buf[]="11000";

gethostname(hostname, 64);
ret = MPI_Init (, );
if (ret) {
printf ("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_size (MPI_COMM_WORLD, );
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
ret = MPI_Comm_rank (MPI_COMM_WORLD, );
if (ret) {
printf("Error: %d\n", ret);
return (1);
}
printf ("First Hostname: %s node %d out of %d\n", hostname,
rank, size);
if (size%2 && rank==size-1)
printf("Sitting out\n");
else {
if (rank%2==0)
MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11,
MPI_COMM_WORLD);
else
MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11,
MPI_COMM_WORLD, );
}
printf ("Second Hostname: %s node %d out of %d\n", hostname,
rank, size);

if (rank == 1) {
ret = tm_init(NULL, _root);
do_check(ret, "tm_init failed");
printf ("Special Hostname: %s node %d out of %d\n",
hostname, rank, size);
task_id = 0xdeadbeef;
event = 0xdeadbeef;
printf("%s\t%s", input[0], input[1]);

tm_finalize();
}

MPI_Finalize ();

return (0);
}

And the error I am getting is:

First Hostname: wins05 node 0 out of 4
First Hostname: wins03 node 1 out of 4
First Hostname: wins02 node 2 out of 4
First Hostname: wins01 node 3 out of 4
Second Hostname: wins05 node 0 out of 4
Second Hostname: wins02 node 2 out of 4
Second Hostname: wins03 node 1 out of 4
Second Hostname: wins01 node 3 out of 4
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I am using Torque-2.0.0p7 and Open MPI-1.0.1.

Prakash: are you running an unmodified version of Torque 2.0.0p7?

I will test an unmodified version of 2.0.0p8 right now and let you know,
but I am positive that is not the issue.


TIA,
Prakash

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham
> Sent: Friday, April 07, 2006 10:13 AM
> To: Open MPI Users
> Cc: pak@sun.com
> Subject: Re: [OMPI users] Open MPI and Torque error
> 
> Pak Lui wrote:
> > Prakash,
> >
> > tm_poll: protocol number dis error 11
> > ret is 17002 instead of 0: tm_init failed
> > 3 processes killed (possibly by Open MPI)
> >
> > I encountered similar problem with OpenPBS before, which 
> also uses the 
> > TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I 
> tried to 
> > call tm_init for the second time (which in turns call tm_poll and 
> > returned that errno).
> >
> > I think what you did to start tm_init from another node and 
> connect to 
> > another mom w

Re: [OMPI users] Open MPI and Torque error

2006-04-07 Thread Prakash Velayutham

Pak Lui wrote:

Prakash,

tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I encountered similar problem with OpenPBS before, which also uses the 
TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to 
call tm_init for the second time (which in turns call tm_poll and 
returned that errno).


I think what you did to start tm_init from another node and connect to 
another mom which I do not think is allowed. The TM module in OpenMPI 
already called tm_init once. I am curious to know about the reason that 
you need to call tm_init again?


If you are curious to know about the implementation for PBS, you can 
download the source from openpbs.org. OpenPBS source: 
v2.3.16/src/lib/Libifl/tm.c
I am interested in getting this to work as I am working on implementing 
support for dynamic scheduling in Torque. I want any node in an MPI-2 
job (basically Open MPI implementation) to be able to request the 
Torque/PBS server for more nodes. I am doing a little study in that 
right now. Instead of nodes talking directly to the server, I want them 
to be able to talk to Mother Superior and MS instead will talk to the 
Server.


Could you please explain why this does not work now? And why it works 
when I do the tm_init from MS, and only does not work from any other MOM?


Thanks,
Prakash


[OMPI users] Open MPI and Torque error

2006-04-01 Thread Prakash Velayutham

Hi Jeff,

I have a minimal MPI program to test the TM interface and strangely I seem to 
get errors during tm_init call. Could you explain what could be wrong? Have you 
seen anything similar. Here is the MPI code:

#include 
#include 
#include 

extern char **environ;

void do_check(int val, char *msg) {
   if (TM_SUCCESS != val) {
   printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg);
   exit(1);
   }
}

main (int argc, char *argv[]) {
   int size, rank, ret, err, numnodes, local_err;
   MPI_Status status;
   char **input;
   input[0] = "/bin/echo";
   input[1] = "Hello There";
   struct tm_roots task_root;
   tm_node_id *nodelist;
   tm_event_t event;
   tm_task_id task_id;

   char hostname[64];
   char 
buf[]="11000";

   gethostname(hostname, 64);
   ret = MPI_Init (, );
   if (ret) {
   printf ("Error: %d\n", ret);
   return (1);
   }
   ret = MPI_Comm_size (MPI_COMM_WORLD, );
   if (ret) {
   printf("Error: %d\n", ret);
   return (1);
   }
   ret = MPI_Comm_rank (MPI_COMM_WORLD, );
   if (ret) {
   printf("Error: %d\n", ret);
   return (1);
   }
   printf ("First Hostname: %s node %d out of %d\n", hostname, rank, size);
   if (size%2 && rank==size-1)
   printf("Sitting out\n");
   else {
   if (rank%2==0)
   MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11, 
MPI_COMM_WORLD);
   else
   MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11, 
MPI_COMM_WORLD, );
   }
   printf ("Second Hostname: %s node %d out of %d\n", hostname, rank, size);

   if (rank == 1) {
   ret = tm_init(NULL, _root);
   do_check(ret, "tm_init failed");
   printf ("Special Hostname: %s node %d out of %d\n", hostname, 
rank, size);
   task_id = 0xabcdef;
   event = 0xabcdef;
   printf("%s\t%s", input[0], input[1]);

   tm_finalize();
   }

   MPI_Finalize ();

   return (0);
}

The error I am getting is:

First Hostname: wins05 node 0 out of 4
First Hostname: wins03 node 1 out of 4
First Hostname: wins02 node 2 out of 4
First Hostname: wins01 node 3 out of 4
Second Hostname: wins05 node 0 out of 4
Second Hostname: wins02 node 2 out of 4
Second Hostname: wins03 node 1 out of 4
Second Hostname: wins01 node 3 out of 4
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I am using Torque-2.0.0p7 and Open MPI-1.0.1.

Thanks,
Prakash