Message: 1
Date: Wed, 7 Feb 2007 17:37:41 -0500
From: "Alex Tumanov" <atuma...@gmail.com>
Subject: Re: [OMPI users] first time user - can run mpi job SMP but
        not over        cluster
To: "Open MPI Users" <us...@open-mpi.org>
Message-ID:
        <2453e2900702071437k20a13e97g5014253aa97cc...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hello,

> mpirun -np 2 myprogram inputfile >outputfile
There can be a whole host of issues with the way you run your
executable and/or the way you have the environment setup. First of
all, when you ssh into the node, does the environment automatically
get updated with correct Open MPI paths? I.e. LD_LIBRARY_PATH should
be correctly set to the OMPI lib directory, PATH should contain OMPI's
bin dir, etc. If this is not the case, you have two options:
a. create small /etc/profile.d scripts to set up those env. variables
b. use --prefix version when you invoke mpirun on the headnode

Generally, it would be much more helpful if you provided the actual
output of running the commands you listed here.

> mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile
Another issue I can think of is path specification to 'myprogram'. Do
you just cd into the directory where it resides and specify its name
only? Try to either specify an absolute path to the executable or path
relative to your homedir: ~/appdir/bin/appexec, assuming this location
is the same on all the nodes. If mpirun can't find your executable on
one of the nodes, it should report that as an error.

> which does not write to the output file.
Does it write anything to stderr? You could also try invoking mpirun
with '--mca pls_rsh_agent ssh'

> mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile`
Are those backquotes?? I would recommend getting mpirun to invoke
something basic on all the participating nodes successfully first, try
mpirun --prefix /path/to/ompi/ --hostfile myhosfile --np 4 hostname
for instance. Nothing else will work until this does.

These are just a few pointers to get you started. Hope this helps.

Alex.

Thanks for the suggestions - the mpirun ... hostname is helping me
narrow down the problem.

Both systems have PATH and LD_LIBRARY_PATH setup properly by
definition - mpirun can launch successfully for an SMP job.

Running mpirun --hostname myhostfile -np 4 hostname (with or without
-- prefix openmpi path) gives the following results:

MASTERNODE
MASTERNODE
(system hangs here and I have to cntl-c to kill mpirun)

I copied myhostfile to a shared directory and attempted the same
command from the slave node and got:

SLAVENODE
SLAVENODE
an echo message from masternode .bashrc
(system hangs here and I have to cntl-c to kill mpirun)

I'm thinking that either my ssh is misbehaving somehow or there is an
issue with having two network connections in each node (I haven't
unplugged the internet connection from my slave node yet and my master
node will always be having an internet connection in addition to the
gigabit cluster network).

I hope this is helpful to try to help me troubleshoot my system.

Thanks!

Mark Kosmowski

Reply via email to