Hi Lucho,

I am provisioning with perceus, and in order to get static node
addresses I have entries in /etc/hosts that define them, e.g.:

10.10.0.10      n0000
10.10.0.11      n0001
10.10.0.12      n0002

My /etc/nsswitch.conf is set to resolv hosts like:

hosts:      files dns

One thing I have noticed is that the nodes do not have their own
hostname defined after provisioning.  Could this be the problem?

Thanks,
Daniel
On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:
>
>  Hi,
>
>  It looks like the MPI processes on the nodes don't send a correct IP
> address to connect to. In your case, they send:
>
>
> >        -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
> value=port#38675$description#(none)$
> >
>
>
>  And when I run it, I see:
>
>         -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
> value=port#34283$description#m10$ifname#192.168.1.110$
>
>  I tried to figure out how does mpich pick the IP address, and it looks like
> it uses the hostname on the node for that. Do you have the node names setup
> correctly?
>
>  Thanks,
>         Lucho
>
>  On Nov 4, 2008, at 1:31 PM, Daniel Gruner wrote:
>
>
> >
> > Hi Lucho,
> >
> > Did you have a chance to look at this?  Needless to say it has been
> > quite frustrating, and perhaps it has to do with the particular Linux
> > distribution you run.  I am running on a RHEL5.2 system with kernel
> > 2.6.26, and the compilation of mpich2 or mvapich2 is totally vanilla.
> > My network is just GigE.  xmvapich works for a single process, but it
> > always hangs for more than one, regardless of whether they are on the
> > same node or separate nodes, and independently of the example program
> > (hellow, cpi, etc).  Other than some administration issues (like the
> > authentication stuff I have been exchanging with Abhishek about), this
> > is the only real obstacle to making my clusters suitable for
> > production...
> >
> > Thanks,
> > Daniel
> >
> > ---------- Forwarded message ----------
> > From: Daniel Gruner <[EMAIL PROTECTED]>
> > Date: Oct 8, 2008 2:49 PM
> > Subject: Re: [xcpu] Re: (s)xcpu and MPI
> > To: [email protected]
> >
> >
> > Hi Lucho,
> >
> > Here is the output (two nodes in the cluster):
> >
> > [EMAIL PROTECTED] examples]# xmvapich -D -a ./hellow
> > -pmi-> 0: cmd=initack pmiid=1
> > <-pmi- 0: cmd=initack rc=0
> > <-pmi- 0: cmd=set rc=0 size=2
> > <-pmi- 0: cmd=set rc=0 rank=0
> > <-pmi- 0: cmd=set rc=0 debug=0
> > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
> > <-pmi- 0: cmd=response_to_init rc=0
> > -pmi-> 0: cmd=get_maxes
> > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
> > -pmi-> 0: cmd=get_appnum
> > <-pmi- 0: cmd=appnum rc=0 appnum=0
> > -pmi-> 1: cmd=initack pmiid=1
> > <-pmi- 1: cmd=initack rc=0
> > <-pmi- 1: cmd=set rc=0 size=2
> > <-pmi- 1: cmd=set rc=0 rank=1
> > <-pmi- 1: cmd=set rc=0 debug=0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
> > <-pmi- 1: cmd=response_to_init rc=0
> > -pmi-> 1: cmd=get_maxes
> > <-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
> > -pmi-> 1: cmd=get_appnum
> > <-pmi- 1: cmd=appnum rc=0 appnum=0
> > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
> > value=port#38675$description#(none)$
> > <-pmi- 0: cmd=put_result rc=0
> > -pmi-> 1: cmd=get_my_kvsname
> > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 0: cmd=barrier_in
> > -pmi-> 1: cmd=get_my_kvsname
> > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
> > value=port#38697$description#(none)$
> > <-pmi- 1: cmd=put_result rc=0
> > -pmi-> 1: cmd=barrier_in
> > <-pmi- 0: cmd=barrier_out rc=0
> > <-pmi- 1: cmd=barrier_out rc=0
> > -pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
> > <-pmi- 0: cmd=get_result rc=0
> value=port#38697$description#(none)$
> > -pmi-> 1: cmd=get kvsname=kvs_0 key=P0-businesscard
> > <-pmi- 1: cmd=get_result rc=0
> value=port#38675$description#(none)$
> >
> > Hello world from process 1 of 2
> > Hello world from process 0 of 2
> >
> >
> > It looks like it ran, but then it hangs and never returns.
> >
> > If I try to run another example (cpi), here is the output from the run
> > with a single process, and then with two:
> >
> > [EMAIL PROTECTED] examples]# xmvapich n0001 ./cpi
> > Process 0 of 1 is on (none)
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000313
> > [EMAIL PROTECTED] examples]# xmvapich -D n0001 ./cpi
> > -pmi-> 0: cmd=initack pmiid=1
> > <-pmi- 0: cmd=initack rc=0
> > <-pmi- 0: cmd=set rc=0 size=1
> > <-pmi- 0: cmd=set rc=0 rank=0
> > <-pmi- 0: cmd=set rc=0 debug=0
> > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
> > <-pmi- 0: cmd=response_to_init rc=0
> > -pmi-> 0: cmd=get_maxes
> > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
> > -pmi-> 0: cmd=get_appnum
> > <-pmi- 0: cmd=appnum rc=0 appnum=0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
> > value=port#48513$description#(none)$
> > <-pmi- 0: cmd=put_result rc=0
> > -pmi-> 0: cmd=barrier_in
> > <-pmi- 0: cmd=barrier_out rc=0
> > -pmi-> 0: cmd=finalize
> > <-pmi- 0: cmd=finalize_ack rc=0
> > Process 0 of 1 is on (none)
> > pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > wall clock time = 0.000332
> > [EMAIL PROTECTED] examples]
> >
> > normal termination.
> >
> > [EMAIL PROTECTED] examples]# xmvapich -D n0000,n0001 ./cpi
> > -pmi-> 0: cmd=initack pmiid=1
> > <-pmi- 0: cmd=initack rc=0
> > <-pmi- 0: cmd=set rc=0 size=2
> > <-pmi- 0: cmd=set rc=0 rank=0
> > <-pmi- 0: cmd=set rc=0 debug=0
> > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
> > <-pmi- 0: cmd=response_to_init rc=0
> > -pmi-> 0: cmd=get_maxes
> > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
> > -pmi-> 0: cmd=get_appnum
> > <-pmi- 0: cmd=appnum rc=0 appnum=0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=initack pmiid=1
> > <-pmi- 1: cmd=initack rc=0
> > <-pmi- 1: cmd=set rc=0 size=2
> > <-pmi- 1: cmd=set rc=0 rank=1
> > <-pmi- 1: cmd=set rc=0 debug=0
> > -pmi-> 0: cmd=get_my_kvsname
> > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
> > <-pmi- 1: cmd=response_to_init rc=0
> > -pmi-> 1: cmd=get_maxes
> > <-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
> > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
> > value=port#45645$description#(none)$
> > <-pmi- 0: cmd=put_result rc=0
> > -pmi-> 1: cmd=get_appnum
> > <-pmi- 1: cmd=appnum rc=0 appnum=0
> > -pmi-> 0: cmd=barrier_in
> > -pmi-> 1: cmd=get_my_kvsname
> > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=get_my_kvsname
> > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
> > -pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
> > value=port#53467$description#(none)$
> > <-pmi- 1: cmd=put_result rc=0
> > -pmi-> 1: cmd=barrier_in
> > <-pmi- 0: cmd=barrier_out rc=0
> > <-pmi- 1: cmd=barrier_out rc=0
> > -pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
> > <-pmi- 0: cmd=get_result rc=0
> value=port#53467$description#(none)$
> > Process 0 of 2 is on (none)
> > Process 1 of 2 is on (none)
> >
> > hung processes....
> >
> >
> > Daniel
> >
> >
> > On Wed, Oct 8, 2008 at 3:23 PM, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > I can't replicate it, it is working fine here :(
> > > Can you please try xmvapich again with -D option and cut&paste the
> output?
> > >
> > > Thanks,
> > >      Lucho
> > >
> > > On Oct 6, 2008, at 2:51 PM, Daniel Gruner wrote:
> > >
> > >
> > > >
> > > > I just compiled mpich2-1.1.0a1, and tested it, with the same result as
> > > > with mvapich.  Again I had to do the configure with
> > > > --with-device=ch3:sock, since otherwise the runtime complains that it
> > > > can't allocate shared memory or some such thing.  When I run a single
> > > > process using xmvapich it completes fine.  However when running two or
> > > > more it hangs.  This is not surprising as it should be the same as
> > > > mvapich when running over regular TCP/IP on GigE rather than a special
> > > > interconnect.
> > > >
> > > > [EMAIL PROTECTED] examples]# ./hellow
> > > > Hello world from process 0 of 1
> > > > [EMAIL PROTECTED] examples]# xmvapich -a ./hellow
> > > > Hello world from process 1 of 2
> > > > Hello world from process 0 of 2
> > > > ^C
> > > > [EMAIL PROTECTED] examples]# xmvapich n0000 ./hellow
> > > > Hello world from process 0 of 1
> > > > [EMAIL PROTECTED] examples]# xmvapich n0001 ./hellow
> > > > Hello world from process 0 of 1
> > > > [EMAIL PROTECTED] examples]# xmvapich n0000,n0001 ./hellow
> > > > Hello world from process 1 of 2
> > > > Hello world from process 0 of 2
> > > > ^C
> > > >
> > > > Daniel
> > > >
> > > >
> > > >
> > > > On 10/6/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:
> > > >
> > > > >
> > > > > I just compiled mpich2-1.1.0a1 and tried running hellow, everything
> looks
> > > > > fine:
> > > > >
> > > > > $ xmvapich m1,m2
> > > > > ~/work/mpich2-1.1.0a1/build/examples/hellow
> > > > > Hello world from process 0 of 2
> > > > > Hello world from process 1 of 2
> > > > > $
> > > > >
> > > > > I didn't set any special parameters when compiling, just
> ./configure.
> > > > >
> > > > > Thanks,
> > > > >     Lucho
> > > > >
> > > > >
> > > > > On Oct 3, 2008, at 9:05 AM, Daniel Gruner wrote:
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Well, I just did the same, but with NO success...  The processes
> are
> > > > > > apparently started, run at the beginning, but then they hang and
> do
> > > > > > not finalize.  For example, running the "hellow" example from the
> > > > > > mvapich2 distribution:
> > > > > >
> > > > > > [EMAIL PROTECTED] examples]# cat hellow.c
> > > > > > /* -*- Mode: C; c-basic-offset:4 ; -*- */
> > > > > > /*
> > > > > > *  (C) 2001 by Argonne National Laboratory.
> > > > > > *      See COPYRIGHT in top-level directory.
> > > > > > */
> > > > > >
> > > > > > #include <stdio.h>
> > > > > > #include "mpi.h"
> > > > > >
> > > > > > int main( int argc, char *argv[] )
> > > > > > {
> > > > > > int rank;
> > > > > > int size;
> > > > > >
> > > > > > MPI_Init( 0, 0 );
> > > > > > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > > > > > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > > > > > printf( "Hello world from process %d of %d\n", rank, size );
> > > > > > MPI_Finalize();
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > [EMAIL PROTECTED] examples]# make hellow
> > > > > > ../bin/mpicc  -I../src/include -I../src/include   -c hellow.c
> > > > > > ../bin/mpicc   -o hellow hellow.o
> > > > > > [EMAIL PROTECTED] examples]# ./hellow
> > > > > > Hello world from process 0 of 1
> > > > > >
> > > > > > (this was fine, just running on the master).  Running on the two
> nodes
> > > > > > requires that the xmvapich process be killed (ctrl-C):
> > > > > >
> > > > > > [EMAIL PROTECTED] examples]# xmvapich -ap ./hellow
> > > > > > n0000: Hello world from process 0 of 2
> > > > > > n0001: Hello world from process 1 of 2
> > > > > > [EMAIL PROTECTED] examples]#
> > > > > >
> > > > > > I have tried other codes, both in C and Fortran, with the same
> > > > > > behaviour.  I don't know if the issue is with xmvapich or with
> > > > > > mvapich2.  Communication is just GigE.
> > > > > >
> > > > > > Daniel
> > > > > >
> > > > > >
> > > > > > On 9/30/08, Abhishek Kulkarni <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Just gave this a quick try, and xmvapich seems to run MPI apps
> compiled
> > > > > > > with mpich2 without any issues.
> > > > > > >
> > > > > > > $ xmvapich -a ./mpihello
> > > > > > > blender: Hello World from process 0 of 1
> > > > > > > eregion: Hello World from process 0 of 1
> > > > > > >
> > > > > > > Hope that helps,
> > > > > > >
> > > > > > >
> > > > > > > -- Abhishek
> > > > > > >
> > > > > > >
> > > > > > > On Tue, 2008-09-30 at 17:02 +0200, Stefan Boresch wrote:
> > > > > > >
> > > > > > >
> > > > > > > > Thanks for the quick reply!
> > > > > > > >
> > > > > > > > On Tue, Sep 30, 2008 at 07:34:37AM -0700, ron minnich wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 30, 2008 at 1:57 AM, stefan
> <[EMAIL PROTECTED]>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > the state of xcpu support with MPI libraries -- either of
> the
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > common
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > free ones
> > > > > > > > > > is fine (e.g., openmpi, mpich2)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > there is now support for mpich2. openmpi is not supported as
> openmpi
> > > > > > > > > is (once again) in flux. it has been supported numerous
> times and
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > has
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > changed out from under us numerous times. I no longer use
> openmpi if
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > I
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > have a working mvapich or mpich available.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > I am slightly confused. I guess I had inferred the openmpi
> issues from
> > > > > > > > the various mailing lists. But I just looked at the latest
> mpich2
> > > > > > > > prerelease and found no mentioning of (s)xcpu(2). I thought
> that some
> > > > > > > > patches/support on the side of the mpi library are necessary
> (as,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > e.g.,
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > openmpi provides for bproc ...)  Or am I completely
> misunderstanding
> > > > > > > > something here, and this is somehow handled by xcpu itself ...
> > > > > > > > I guess there is some difference between
> > > > > > > >
> > > > > > > > xrx 192.168.19.2 /bin/date
> > > > > > > >
> > > > > > > > and
> > > > > > > >
> > > > > > > > xrx 192.168.19.2 <pathto>/mpiexec ...
> > > > > > > >
> > > > > > > > and the latter seems too magic to me to run out of the box (it
> sure
> > > > > > > > would be nice though ...)
> > > > > > > >
> > > > > > > > Sorry for making myself a nuisance -- thanks,
> > > > > > > >
> > > > > > > > Stefan Boresch
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> >
>
>

Reply via email to