Hi Lucho, I am provisioning with perceus, and in order to get static node addresses I have entries in /etc/hosts that define them, e.g.:
10.10.0.10 n0000 10.10.0.11 n0001 10.10.0.12 n0002 My /etc/nsswitch.conf is set to resolv hosts like: hosts: files dns One thing I have noticed is that the nodes do not have their own hostname defined after provisioning. Could this be the problem? Thanks, Daniel On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > Hi, > > It looks like the MPI processes on the nodes don't send a correct IP > address to connect to. In your case, they send: > > > > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard > value=port#38675$description#(none)$ > > > > > And when I run it, I see: > > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard > value=port#34283$description#m10$ifname#192.168.1.110$ > > I tried to figure out how does mpich pick the IP address, and it looks like > it uses the hostname on the node for that. Do you have the node names setup > correctly? > > Thanks, > Lucho > > On Nov 4, 2008, at 1:31 PM, Daniel Gruner wrote: > > > > > > Hi Lucho, > > > > Did you have a chance to look at this? Needless to say it has been > > quite frustrating, and perhaps it has to do with the particular Linux > > distribution you run. I am running on a RHEL5.2 system with kernel > > 2.6.26, and the compilation of mpich2 or mvapich2 is totally vanilla. > > My network is just GigE. xmvapich works for a single process, but it > > always hangs for more than one, regardless of whether they are on the > > same node or separate nodes, and independently of the example program > > (hellow, cpi, etc). Other than some administration issues (like the > > authentication stuff I have been exchanging with Abhishek about), this > > is the only real obstacle to making my clusters suitable for > > production... > > > > Thanks, > > Daniel > > > > ---------- Forwarded message ---------- > > From: Daniel Gruner <[EMAIL PROTECTED]> > > Date: Oct 8, 2008 2:49 PM > > Subject: Re: [xcpu] Re: (s)xcpu and MPI > > To: [email protected] > > > > > > Hi Lucho, > > > > Here is the output (two nodes in the cluster): > > > > [EMAIL PROTECTED] examples]# xmvapich -D -a ./hellow > > -pmi-> 0: cmd=initack pmiid=1 > > <-pmi- 0: cmd=initack rc=0 > > <-pmi- 0: cmd=set rc=0 size=2 > > <-pmi- 0: cmd=set rc=0 rank=0 > > <-pmi- 0: cmd=set rc=0 debug=0 > > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1 > > <-pmi- 0: cmd=response_to_init rc=0 > > -pmi-> 0: cmd=get_maxes > > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 > > -pmi-> 0: cmd=get_appnum > > <-pmi- 0: cmd=appnum rc=0 appnum=0 > > -pmi-> 1: cmd=initack pmiid=1 > > <-pmi- 1: cmd=initack rc=0 > > <-pmi- 1: cmd=set rc=0 size=2 > > <-pmi- 1: cmd=set rc=0 rank=1 > > <-pmi- 1: cmd=set rc=0 debug=0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1 > > <-pmi- 1: cmd=response_to_init rc=0 > > -pmi-> 1: cmd=get_maxes > > <-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 > > -pmi-> 1: cmd=get_appnum > > <-pmi- 1: cmd=appnum rc=0 appnum=0 > > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard > > value=port#38675$description#(none)$ > > <-pmi- 0: cmd=put_result rc=0 > > -pmi-> 1: cmd=get_my_kvsname > > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 0: cmd=barrier_in > > -pmi-> 1: cmd=get_my_kvsname > > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard > > value=port#38697$description#(none)$ > > <-pmi- 1: cmd=put_result rc=0 > > -pmi-> 1: cmd=barrier_in > > <-pmi- 0: cmd=barrier_out rc=0 > > <-pmi- 1: cmd=barrier_out rc=0 > > -pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard > > <-pmi- 0: cmd=get_result rc=0 > value=port#38697$description#(none)$ > > -pmi-> 1: cmd=get kvsname=kvs_0 key=P0-businesscard > > <-pmi- 1: cmd=get_result rc=0 > value=port#38675$description#(none)$ > > > > Hello world from process 1 of 2 > > Hello world from process 0 of 2 > > > > > > It looks like it ran, but then it hangs and never returns. > > > > If I try to run another example (cpi), here is the output from the run > > with a single process, and then with two: > > > > [EMAIL PROTECTED] examples]# xmvapich n0001 ./cpi > > Process 0 of 1 is on (none) > > pi is approximately 3.1415926544231341, Error is 0.0000000008333410 > > wall clock time = 0.000313 > > [EMAIL PROTECTED] examples]# xmvapich -D n0001 ./cpi > > -pmi-> 0: cmd=initack pmiid=1 > > <-pmi- 0: cmd=initack rc=0 > > <-pmi- 0: cmd=set rc=0 size=1 > > <-pmi- 0: cmd=set rc=0 rank=0 > > <-pmi- 0: cmd=set rc=0 debug=0 > > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1 > > <-pmi- 0: cmd=response_to_init rc=0 > > -pmi-> 0: cmd=get_maxes > > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 > > -pmi-> 0: cmd=get_appnum > > <-pmi- 0: cmd=appnum rc=0 appnum=0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard > > value=port#48513$description#(none)$ > > <-pmi- 0: cmd=put_result rc=0 > > -pmi-> 0: cmd=barrier_in > > <-pmi- 0: cmd=barrier_out rc=0 > > -pmi-> 0: cmd=finalize > > <-pmi- 0: cmd=finalize_ack rc=0 > > Process 0 of 1 is on (none) > > pi is approximately 3.1415926544231341, Error is 0.0000000008333410 > > wall clock time = 0.000332 > > [EMAIL PROTECTED] examples] > > > > normal termination. > > > > [EMAIL PROTECTED] examples]# xmvapich -D n0000,n0001 ./cpi > > -pmi-> 0: cmd=initack pmiid=1 > > <-pmi- 0: cmd=initack rc=0 > > <-pmi- 0: cmd=set rc=0 size=2 > > <-pmi- 0: cmd=set rc=0 rank=0 > > <-pmi- 0: cmd=set rc=0 debug=0 > > -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1 > > <-pmi- 0: cmd=response_to_init rc=0 > > -pmi-> 0: cmd=get_maxes > > <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 > > -pmi-> 0: cmd=get_appnum > > <-pmi- 0: cmd=appnum rc=0 appnum=0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=initack pmiid=1 > > <-pmi- 1: cmd=initack rc=0 > > <-pmi- 1: cmd=set rc=0 size=2 > > <-pmi- 1: cmd=set rc=0 rank=1 > > <-pmi- 1: cmd=set rc=0 debug=0 > > -pmi-> 0: cmd=get_my_kvsname > > <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1 > > <-pmi- 1: cmd=response_to_init rc=0 > > -pmi-> 1: cmd=get_maxes > > <-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 > > -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard > > value=port#45645$description#(none)$ > > <-pmi- 0: cmd=put_result rc=0 > > -pmi-> 1: cmd=get_appnum > > <-pmi- 1: cmd=appnum rc=0 appnum=0 > > -pmi-> 0: cmd=barrier_in > > -pmi-> 1: cmd=get_my_kvsname > > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=get_my_kvsname > > <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 > > -pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard > > value=port#53467$description#(none)$ > > <-pmi- 1: cmd=put_result rc=0 > > -pmi-> 1: cmd=barrier_in > > <-pmi- 0: cmd=barrier_out rc=0 > > <-pmi- 1: cmd=barrier_out rc=0 > > -pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard > > <-pmi- 0: cmd=get_result rc=0 > value=port#53467$description#(none)$ > > Process 0 of 2 is on (none) > > Process 1 of 2 is on (none) > > > > hung processes.... > > > > > > Daniel > > > > > > On Wed, Oct 8, 2008 at 3:23 PM, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > > > > > > > I can't replicate it, it is working fine here :( > > > Can you please try xmvapich again with -D option and cut&paste the > output? > > > > > > Thanks, > > > Lucho > > > > > > On Oct 6, 2008, at 2:51 PM, Daniel Gruner wrote: > > > > > > > > > > > > > > I just compiled mpich2-1.1.0a1, and tested it, with the same result as > > > > with mvapich. Again I had to do the configure with > > > > --with-device=ch3:sock, since otherwise the runtime complains that it > > > > can't allocate shared memory or some such thing. When I run a single > > > > process using xmvapich it completes fine. However when running two or > > > > more it hangs. This is not surprising as it should be the same as > > > > mvapich when running over regular TCP/IP on GigE rather than a special > > > > interconnect. > > > > > > > > [EMAIL PROTECTED] examples]# ./hellow > > > > Hello world from process 0 of 1 > > > > [EMAIL PROTECTED] examples]# xmvapich -a ./hellow > > > > Hello world from process 1 of 2 > > > > Hello world from process 0 of 2 > > > > ^C > > > > [EMAIL PROTECTED] examples]# xmvapich n0000 ./hellow > > > > Hello world from process 0 of 1 > > > > [EMAIL PROTECTED] examples]# xmvapich n0001 ./hellow > > > > Hello world from process 0 of 1 > > > > [EMAIL PROTECTED] examples]# xmvapich n0000,n0001 ./hellow > > > > Hello world from process 1 of 2 > > > > Hello world from process 0 of 2 > > > > ^C > > > > > > > > Daniel > > > > > > > > > > > > > > > > On 10/6/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > I just compiled mpich2-1.1.0a1 and tried running hellow, everything > looks > > > > > fine: > > > > > > > > > > $ xmvapich m1,m2 > > > > > ~/work/mpich2-1.1.0a1/build/examples/hellow > > > > > Hello world from process 0 of 2 > > > > > Hello world from process 1 of 2 > > > > > $ > > > > > > > > > > I didn't set any special parameters when compiling, just > ./configure. > > > > > > > > > > Thanks, > > > > > Lucho > > > > > > > > > > > > > > > On Oct 3, 2008, at 9:05 AM, Daniel Gruner wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Well, I just did the same, but with NO success... The processes > are > > > > > > apparently started, run at the beginning, but then they hang and > do > > > > > > not finalize. For example, running the "hellow" example from the > > > > > > mvapich2 distribution: > > > > > > > > > > > > [EMAIL PROTECTED] examples]# cat hellow.c > > > > > > /* -*- Mode: C; c-basic-offset:4 ; -*- */ > > > > > > /* > > > > > > * (C) 2001 by Argonne National Laboratory. > > > > > > * See COPYRIGHT in top-level directory. > > > > > > */ > > > > > > > > > > > > #include <stdio.h> > > > > > > #include "mpi.h" > > > > > > > > > > > > int main( int argc, char *argv[] ) > > > > > > { > > > > > > int rank; > > > > > > int size; > > > > > > > > > > > > MPI_Init( 0, 0 ); > > > > > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > > > > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > > > > > printf( "Hello world from process %d of %d\n", rank, size ); > > > > > > MPI_Finalize(); > > > > > > return 0; > > > > > > } > > > > > > > > > > > > [EMAIL PROTECTED] examples]# make hellow > > > > > > ../bin/mpicc -I../src/include -I../src/include -c hellow.c > > > > > > ../bin/mpicc -o hellow hellow.o > > > > > > [EMAIL PROTECTED] examples]# ./hellow > > > > > > Hello world from process 0 of 1 > > > > > > > > > > > > (this was fine, just running on the master). Running on the two > nodes > > > > > > requires that the xmvapich process be killed (ctrl-C): > > > > > > > > > > > > [EMAIL PROTECTED] examples]# xmvapich -ap ./hellow > > > > > > n0000: Hello world from process 0 of 2 > > > > > > n0001: Hello world from process 1 of 2 > > > > > > [EMAIL PROTECTED] examples]# > > > > > > > > > > > > I have tried other codes, both in C and Fortran, with the same > > > > > > behaviour. I don't know if the issue is with xmvapich or with > > > > > > mvapich2. Communication is just GigE. > > > > > > > > > > > > Daniel > > > > > > > > > > > > > > > > > > On 9/30/08, Abhishek Kulkarni <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Just gave this a quick try, and xmvapich seems to run MPI apps > compiled > > > > > > > with mpich2 without any issues. > > > > > > > > > > > > > > $ xmvapich -a ./mpihello > > > > > > > blender: Hello World from process 0 of 1 > > > > > > > eregion: Hello World from process 0 of 1 > > > > > > > > > > > > > > Hope that helps, > > > > > > > > > > > > > > > > > > > > > -- Abhishek > > > > > > > > > > > > > > > > > > > > > On Tue, 2008-09-30 at 17:02 +0200, Stefan Boresch wrote: > > > > > > > > > > > > > > > > > > > > > > Thanks for the quick reply! > > > > > > > > > > > > > > > > On Tue, Sep 30, 2008 at 07:34:37AM -0700, ron minnich wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 30, 2008 at 1:57 AM, stefan > <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the state of xcpu support with MPI libraries -- either of > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > common > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > free ones > > > > > > > > > > is fine (e.g., openmpi, mpich2) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > there is now support for mpich2. openmpi is not supported as > openmpi > > > > > > > > > is (once again) in flux. it has been supported numerous > times and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > has > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > changed out from under us numerous times. I no longer use > openmpi if > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > have a working mvapich or mpich available. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am slightly confused. I guess I had inferred the openmpi > issues from > > > > > > > > the various mailing lists. But I just looked at the latest > mpich2 > > > > > > > > prerelease and found no mentioning of (s)xcpu(2). I thought > that some > > > > > > > > patches/support on the side of the mpi library are necessary > (as, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > e.g., > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > openmpi provides for bproc ...) Or am I completely > misunderstanding > > > > > > > > something here, and this is somehow handled by xcpu itself ... > > > > > > > > I guess there is some difference between > > > > > > > > > > > > > > > > xrx 192.168.19.2 /bin/date > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > xrx 192.168.19.2 <pathto>/mpiexec ... > > > > > > > > > > > > > > > > and the latter seems too magic to me to run out of the box (it > sure > > > > > > > > would be nice though ...) > > > > > > > > > > > > > > > > Sorry for making myself a nuisance -- thanks, > > > > > > > > > > > > > > > > Stefan Boresch > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
