Re: [OMPI users] check point restart

2013-07-19 Thread Erik Nelson
Thanks Lloyd, Ralph . . regarding Ralph's comment,

>I don't understand the comment about printing and recompiling. Usually,
people just have the app
>write its intermediate results to a file, and provide a cmd line option ..

right, I shouldn't have written compile. It probably wouldn't increase the
communications overhead
that much to do this, I was just wondering if there might be something
simpler.

Erik


[OMPI users] check point restart

2013-07-19 Thread Erik Nelson
I run mpi on an NSF computer. One of the conditions of use is that jobs are
limited to 24 hr
duration to provide democratic allotment to its users.

A long program can require many restarts, so it becomes necessary to store
the state of the
program in memory, print it, recompile, and and read the state to start
again.

I seem to remember a simpler approach (check point restart?) in which the
state of the .exe
code is saved and then simply restarted from its current position.

Is there something like this for restarting an mpi program?

Thanks, Erik


-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


Re: [OMPI users] qsub error

2013-02-16 Thread Erik Nelson
yep, runs well now.

On Sat, Feb 16, 2013 at 6:50 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Glad you got it working!
>
> On Feb 15, 2013, at 6:53 PM, Erik Nelson <nelsoner...@gmail.com> wrote:
>
> > I may have deleted any responses to this message. In either case, we
> appear to have fixed the problem
> > by installing a more current version of openmpi.
> >
> >
> > On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson <nelsoner...@gmail.com>
> wrote:
> >
> > I'm encountering an error using qsub that none of us can figure out. MPI
> C++ programs seem to
> > run fine when executed from the command line, but for some reason when I
> submit them through
> > the queue I get a strange error message ..
> >
> >
> >
> [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> > connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission
> denied (13)
> >
> >
> > the compute node 3-12 doesn't matter (the error can generate from any of
> the nodes, and I'm
> > guessing that 3-12 is the parent node here).
> >
> > To check if there was some problem with my own code, I created a simple
> 'hello world' program
> > (see attached files).
> >
> > Again, the program runs fine from the command line but fails in qsub
> with the same sort of error
> > message.
> >
> > I have included (i) the code (ii) the job script for qsub, and (iii) the
> ".o" file from qsub for the
> > "hello world" program.
> >
> > These don't look like MPI errors, but rather some conflict with, maybe,
> secure communication
> > across nodes.
> >
> > Is there something simple I can do to fix this?
> >
> > Thanks, Erik
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> >
> >
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > _______
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


Re: [OMPI users] qsub error

2013-02-15 Thread Erik Nelson
I may have deleted any responses to this message. In either case, we appear
to have fixed the problem
by installing a more current version of openmpi.


On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson <nelsoner...@gmail.com> wrote:

>
> I'm encountering an error using qsub that none of us can figure out. MPI
> C++ programs seem to
> run fine when executed from the command line, but for some reason when I
> submit them through
> the queue I get a strange error message ..
>
>
> [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>
> connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied
> (13)
>
>
> the compute node 3-12 doesn't matter (the error can generate from any of
> the nodes, and I'm
> guessing that 3-12 is the parent node here).
>
> To check if there was some problem with my own code, I created a simple
> 'hello world' program
> (see attached files).
>
> Again, the program runs fine from the command line but fails in qsub with
> the same sort of error
> message.
>
> I have included (i) the code (ii) the job script for qsub, and (iii) the
> ".o" file from qsub for the
> "hello world" program.
>
> These don't look like MPI errors, but rather some conflict with, maybe,
> secure communication
> accross nodes.
>
> Is there something simple I can do to fix this?
>
> Thanks, Erik
>
> --
> Erik Nelson
>
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
>
> p : 214 645 5981
> f : 214 645 5948




-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


[OMPI users] qsub error

2013-02-14 Thread Erik Nelson
I'm encountering an error using qsub that none of us can figure out. MPI
C++ programs seem to
run fine when executed from the command line, but for some reason when I
submit them through
the queue I get a strange error message ..


[compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]

connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied
(13)


the compute node 3-12 doesn't matter (the error can generate from any of
the nodes, and I'm
guessing that 3-12 is the parent node here).

To check if there was some problem with my own code, I created a simple
'hello world' program
(see attached files).

Again, the program runs fine from the command line but fails in qsub with
the same sort of error
message.

I have included (i) the code (ii) the job script for qsub, and (iii) the
".o" file from qsub for the
"hello world" program.

These don't look like MPI errors, but rather some conflict with, maybe,
secure communication
accross nodes.

Is there something simple I can do to fix this?

Thanks,

Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948
#include 
#include "/opt/openmpi/include/mpi.h"

#define bufdim128

int main(int argc, char *argv[])
{
char buffer[bufdim];
char id_str[32];

//  mpi :
MPI::Init(argc,argv);
MPI::Status status;

int size;
int rank;
int tag;

size=MPI::COMM_WORLD.Get_size();
rank=MPI::COMM_WORLD.Get_rank();
tag=0;

if (rank==0) {
	printf("%d: we have %d processors\n",rank,size);
	int i;
	i=1;
	for ( ;i<size; ++i) {
	sprintf(buffer,"hello  %d! ",i);
	MPI::COMM_WORLD.Send(buffer,bufdim,MPI::CHAR,i,tag);
	}
	i=1;
	for ( ;i<size; ++i) {
	MPI::COMM_WORLD.Recv(buffer,bufdim,MPI::CHAR,i,tag,status);
	printf("%d: %s\n",rank,buffer);
	}
}
else {
	MPI::COMM_WORLD.Recv(buffer,bufdim,MPI::CHAR,0,tag,status);

	sprintf(id_str,"processor %d ",rank);
	strncat(buffer,id_str,bufdim-1);
	strncat(buffer,"reporting for duty\n",bufdim-1);

	MPI::COMM_WORLD.Send(buffer,bufdim,MPI::CHAR,0,tag);
}
MPI::Finalize();
return 0;
}




hello.job
Description: Binary data


hello.job.o5822590
Description: Binary data


Re: [OMPI users] restricting a job to a set of hosts

2012-07-28 Thread Erik Nelson
Reuti,

>-nolocal is IMO an option where you want to execute the `mpirun` on your
local login machine and want the MPI >processes to be allocated somewhere
in the cluster, in case you don't have any queuing system around to manage
>the resources.

yes, this is exactly my understanding of the -nolocal option. Otherwise, by
specifying an 'image set' of processors,
everything gets 'mapped' to some subset of processors in the image set.
Again, thanks for your response.


On Fri, Jul 27, 2012 at 5:15 AM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 27.07.2012 um 03:21 schrieb Ralph Castain:
>
> > Application processes will *only* be placed on nodes included in the
> allocation. The -nolocal flag is intended to ensure that no application
> processes are started on the same node as mpirun in the case where that
> node is included in the allocation. This happens, for example, with Torque,
> where mpirun is executed on one of the allocated nodes.
>
> But the behavior is the same in Torque and SGE. The jobscript is executed
> on one of the elected exechosts (neither the submit host, nor the qmaster
> host [unless they are exechosts too]) and so eligible to be used too. In no
> case there should be -nolocal being used.
>
> -nolocal is IMO an option where you want to execute the `mpirun` on your
> local login machine and want the MPI processes to be allocated somewhere in
> the cluster, in case you don't have any queuing system around to manage the
> resources.
>
> -- Reuti
>
> > I believe SGE doesn't do that - and so the allocation won't include the
> submit host, in which case you don't need -nolocal.
> >
> >
> > On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote:
> >
> >> I was under the impression that the -nolocal option keeps processes off
> the submit
> >> host (since there may be hundreds or thousands of jobs submitted at any
> time,
> >> and we don't want this host to be overloaded).
> >>
> >> My understanding of what you said in you last email is that, by listing
> the hosts,  I
> >> automatically send all processes (parent and child, or master and slave
> if you
> >> prefer) to the specified list of hosts.
> >>
> >> Reading your email below, it looks like this was the correct
> understanding.
> >>
> >>
> >> On Thu, Jul 26, 2012 at 5:20 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> >> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
> >>
> >> > Reuti,
> >> >
> >> > Thank you. Our queue is backed up, so it will take a little while
> before I can try this.
> >> >
> >> > I assume that by specifying the nodes this way, I don't need (and it
> would confuse
> >> > the system) to add -nolocal. In other words, qsub will try to put the
> parent node
> >> > somewhere in this set.
> >> >
> >> > Is this the idea?
> >>
> >> Depends what you refer to by "parent node". I assume you mean the
> submit host. This is never included in any created selection of SGE unless
> it's an execution host too.
> >>
> >> The master host of the parallel job (i.e. the one where the jobscript
> with the `mpiexec` is running) will be used as a normal machine from MPI's
> point of view.
> >>
> >> -- Reuti
> >>
> >>
> >> > Erik
> >> >
> >> >
> >> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> >> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> >> >
> >> > > I have a purely parallel job that runs ~100 processes. Each process
> has ~identical
> >> > > overhead so the speed of the program is dominated by the slowest
> processor.
> >> > >
> >> > > For this reason, I would like to restrict the job to a specific set
> of identical (fast)
> >> > > processors on our cluster.
> >> > >
> >> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to
> me what affect these
> >> > > directives will have in a queuing environment.
> >> > >
> >> > > Currently, I submit the job using the "qsub" command in the "sge"
> environment as :
> >> > >
> >> > > qsub -pe mpich 101 jobfile.job
> >> > >
> >> > > where jobfile contains the command
> >> > >
> >> > > mpirun -np 101 -nolocal ./executable
> >> >
> >> > I would leave -nolocal out here.
> >> >
>

Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I see. Thank you both for the prompt replies.

On Thu, Jul 26, 2012 at 8:21 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Application processes will *only* be placed on nodes included in the
> allocation. The -nolocal flag is intended to ensure that no application
> processes are started on the same node as mpirun in the case where that
> node is included in the allocation. This happens, for example, with Torque,
> where mpirun is executed on one of the allocated nodes.
>
> I believe SGE doesn't do that - and so the allocation won't include the
> submit host, in which case you don't need -nolocal.
>
>
> On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote:
>
> I was under the impression that the -nolocal option keeps processes off
> the submit
> host (since there may be hundreds or thousands of jobs submitted at any
> time,
> and we don't want this host to be overloaded).
>
> My understanding of what you said in you last email is that, by listing
> the hosts,  I
> automatically send all processes (parent and child, or master and slave if
> you
> prefer) to the specified list of hosts.
>
> Reading your email below, it looks like this was the correct understanding.
>
>
> On Thu, Jul 26, 2012 at 5:20 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>
>> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
>>
>> > Reuti,
>> >
>> > Thank you. Our queue is backed up, so it will take a little while
>> before I can try this.
>> >
>> > I assume that by specifying the nodes this way, I don't need (and it
>> would confuse
>> > the system) to add -nolocal. In other words, qsub will try to put the
>> parent node
>> > somewhere in this set.
>> >
>> > Is this the idea?
>>
>> Depends what you refer to by "parent node". I assume you mean the submit
>> host. This is never included in any created selection of SGE unless it's an
>> execution host too.
>>
>> The master host of the parallel job (i.e. the one where the jobscript
>> with the `mpiexec` is running) will be used as a normal machine from MPI's
>> point of view.
>>
>> -- Reuti
>>
>>
>> > Erik
>> >
>> >
>> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti <re...@staff.uni-marburg.de>
>> wrote:
>> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
>> >
>> > > I have a purely parallel job that runs ~100 processes. Each process
>> has ~identical
>> > > overhead so the speed of the program is dominated by the slowest
>> processor.
>> > >
>> > > For this reason, I would like to restrict the job to a specific set
>> of identical (fast)
>> > > processors on our cluster.
>> > >
>> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me
>> what affect these
>> > > directives will have in a queuing environment.
>> > >
>> > > Currently, I submit the job using the "qsub" command in the "sge"
>> environment as :
>> > >
>> > >     qsub -pe mpich 101 jobfile.job
>> > >
>> > > where jobfile contains the command
>> > >
>> > > mpirun -np 101 -nolocal ./executable
>> >
>> > I would leave -nolocal out here.
>> >
>> > $ qsub -l
>> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe
>> mpich 101 jobfile.job
>> >
>> > -- Reuti
>> >
>> >
>> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32
>> on our machine,
>> > > each containing 8 cpu's (slots). How do I go about this?
>> > >
>> > > Thanks, Erik
>> > >
>> > > --
>> > > Erik Nelson
>> > >
>> > > Howard Hughes Medical Institute
>> > > 6001 Forest Park Blvd., Room ND10.124
>> > > Dallas, Texas 75235-9050
>> > >
>> > > p : 214 645 5981
>> > > f : 214 645 5948
>> > > ___
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> > --
>> > Erik Nelson
>> >
>> > Howard Hughes Medical Institute
>> &

Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I was under the impression that the -nolocal option keeps processes off the
submit
host (since there may be hundreds or thousands of jobs submitted at any
time,
and we don't want this host to be overloaded).

My understanding of what you said in you last email is that, by listing the
hosts,  I
automatically send all processes (parent and child, or master and slave if
you
prefer) to the specified list of hosts.

Reading your email below, it looks like this was the correct understanding.


On Thu, Jul 26, 2012 at 5:20 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
>
> > Reuti,
> >
> > Thank you. Our queue is backed up, so it will take a little while before
> I can try this.
> >
> > I assume that by specifying the nodes this way, I don't need (and it
> would confuse
> > the system) to add -nolocal. In other words, qsub will try to put the
> parent node
> > somewhere in this set.
> >
> > Is this the idea?
>
> Depends what you refer to by "parent node". I assume you mean the submit
> host. This is never included in any created selection of SGE unless it's an
> execution host too.
>
> The master host of the parallel job (i.e. the one where the jobscript with
> the `mpiexec` is running) will be used as a normal machine from MPI's point
> of view.
>
> -- Reuti
>
>
> > Erik
> >
> >
> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> >
> > > I have a purely parallel job that runs ~100 processes. Each process
> has ~identical
> > > overhead so the speed of the program is dominated by the slowest
> processor.
> > >
> > > For this reason, I would like to restrict the job to a specific set of
> identical (fast)
> > > processors on our cluster.
> > >
> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me
> what affect these
> > > directives will have in a queuing environment.
> > >
> > > Currently, I submit the job using the "qsub" command in the "sge"
> environment as :
> > >
> > > qsub -pe mpich 101 jobfile.job
> > >
> > > where jobfile contains the command
> > >
> > > mpirun -np 101 -nolocal ./executable
> >
> > I would leave -nolocal out here.
> >
> > $ qsub -l
> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe
> mpich 101 jobfile.job
> >
> > -- Reuti
> >
> >
> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32
> on our machine,
> > > each containing 8 cpu's (slots). How do I go about this?
> > >
> > > Thanks, Erik
> > >
> > > --
> > > Erik Nelson
> > >
> > > Howard Hughes Medical Institute
> > > 6001 Forest Park Blvd., Room ND10.124
> > > Dallas, Texas 75235-9050
> > >
> > > p : 214 645 5981
> > > f : 214 645 5948
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


[OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I have a purely parallel job that runs ~100 processes. Each process has
~identical
overhead so the speed of the program is dominated by the slowest processor.

For this reason, I would like to restrict the job to a specific set of
identical (fast)
processors on our cluster.

I read the FAQ on -hosts and -hostfile, but it is still unclear to me what
affect these
directives will have in a queuing environment.

Currently, I submit the job using the "qsub" command in the "sge"
environment as :

qsub -pe mpich 101 jobfile.job

where jobfile contains the command

mpirun -np 101 -nolocal ./executable

I would like to restrict the job to nodes compute-5-1 to compute-5-32 on
our machine,
each containing 8 cpu's (slots). How do I go about this?

Thanks, Erik

-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948