I don't think there is anything different when running on the same node, at least none in xmvapich. I am running a manually built diskless cluster, no Perceus or anything like that. The kernel is 2.6.239, mpich2 1.1.0a1. I am not sure I can make much use of it, but can you attach to each of the processes with gdb and send me a backtrace. You can try running it on the head node, I guess it would be harder to use gdb on the compute node.

Thanks,
        Lucho

On Nov 5, 2008, at 2:51 PM, Daniel Gruner wrote:


What is different when the processes run on the same node from when
they run on separate nodes?  Also, what is your OS version?  Any other
suggestions on how I could help debug this?

Daniel

On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:

The get_result lines are OK, for some reason the processes don't send
"finalize". I can't reproduce it :(


On Nov 5, 2008, at 12:42 PM, Daniel Gruner wrote:



I my case it looks the same, except for the cmd=finalize  stuff:

-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#50475$description#n0000$ifname#10.10.0.10$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#44398$description#n0000$ifname#10.10.0.10$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0
value=port#44398$description#n0000$ifname#10.10.0.10$
-pmi-> 1: cmd=get kvsname=kvs_0 key=P0-businesscard
<-pmi- 1: cmd=get_result rc=0
value=port#50475$description#n0000$ifname#10.10.0.10$
Hello world from process 1 of 2
Hello world from process 0 of 2
[EMAIL PROTECTED] examples]#

and here it hangs...  Actually, there are two extra cmd=get_result
rc=0 lines... ???



On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:


Strange. This is what I see from cmd=barrier_out to the end of
execution:

<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 1: cmd=get kvsname=kvs_0 key=P0-businesscard
<-pmi- 1: cmd=get_result rc=0
value=port#58977$description#m10$ifname#192.168.1.110$
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0
value=port#53028$description#m10$ifname#192.168.1.110$
-pmi-> 1: cmd=finalize
<-pmi- 1: cmd=finalize_ack rc=0
-pmi-> 0: cmd=finalize
<-pmi- 0: cmd=finalize_ack rc=0
Hello world from process 1 of 2
Hello world from process 0 of 2


On Nov 5, 2008, at 12:17 PM, Daniel Gruner wrote:




Ok, some progress.  I am now able to run things like:

[EMAIL PROTECTED] examples]# xmvapich n0000,n0001 ./cpi
Process 0 of 2 is on n0000
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000757
Process 1 of 2 is on n0001

as long as the two nodes specified are different. If, however, I want
to run two processes on the same node, e.g.:

[EMAIL PROTECTED] examples]# xmvapich n0000,n0000 ./cpi
Process 1 of 2 is on n0000
Process 0 of 2 is on n0000

It hangs as before.  Here is the debugging trace:

[EMAIL PROTECTED] examples]# xmvapich -D n0000,n0000 ./cpi
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=2
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 1: cmd=initack pmiid=1
<-pmi- 1: cmd=initack rc=0
<-pmi- 1: cmd=set rc=0 size=2
<-pmi- 1: cmd=set rc=0 rank=1
<-pmi- 1: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 1: cmd=response_to_init rc=0
-pmi-> 1: cmd=get_maxes
<-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64
-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#45956$description#n0000$ifname#10.10.0.10$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#38363$description#n0000$ifname#10.10.0.10$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0
value=port#38363$description#n0000$ifname#10.10.0.10$
Process 1 of 2 is on n0000
Process 0 of 2 is on n0000
[EMAIL PROTECTED] examples]#



On 11/5/08, Daniel Gruner <[EMAIL PROTECTED]> wrote:


That is what I was going for...  It returns (none).

I am about to run the test after explicitly setting up the hostnames
of the nodes.
Does xmvapich probe the nodes for their names? How does it resolve
their addresses?


Daniel

On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:



I guess that is the problem. What do you see if you do:

   xrx n0000 hostname

Thanks,
   Lucho


On Nov 5, 2008, at 12:02 PM, Daniel Gruner wrote:





Hi Lucho,

I am provisioning with perceus, and in order to get static node
addresses I have entries in /etc/hosts that define them, e.g.:

10.10.0.10      n0000
10.10.0.11      n0001
10.10.0.12      n0002

My /etc/nsswitch.conf is set to resolv hosts like:

hosts:      files dns

One thing I have noticed is that the nodes do not have their own hostname defined after provisioning. Could this be the problem?

Thanks,
Daniel
On 11/5/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:




Hi,

It looks like the MPI processes on the nodes don't send a
correct





IP





address to connect to. In your case, they send:





-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard



value=port#38675$description#(none)$









And when I run it, I see:

 -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard






value=port#34283$description#m10$ifname#192.168.1.110$






I tried to figure out how does mpich pick the IP address, and
it





looks









like




it uses the hostname on the node for that. Do you have the
node





names









setup




correctly?

Thanks,
 Lucho

On Nov 4, 2008, at 1:31 PM, Daniel Gruner wrote:






Hi Lucho,

Did you have a chance to look at this?  Needless to say it
has






been






quite frustrating, and perhaps it has to do with the
particular






Linux






distribution you run.  I am running on a RHEL5.2 system with






kernel






2.6.26, and the compilation of mpich2 or mvapich2 is totally






vanilla.






My network is just GigE.  xmvapich works for a single
process,






but it






always hangs for more than one, regardless of whether they
are






on the






same node or separate nodes, and independently of the
example






program






(hellow, cpi, etc).  Other than some administration issues
(like






the






authentication stuff I have been exchanging with Abhishek






about), this






is the only real obstacle to making my clusters suitable for
production...

Thanks,
Daniel

---------- Forwarded message ----------
From: Daniel Gruner <[EMAIL PROTECTED]>
Date: Oct 8, 2008 2:49 PM
Subject: Re: [xcpu] Re: (s)xcpu and MPI
To: [email protected]


Hi Lucho,

Here is the output (two nodes in the cluster):

[EMAIL PROTECTED] examples]# xmvapich -D -a ./hellow
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=2
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64






vallen_max=64






-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 1: cmd=initack pmiid=1
<-pmi- 1: cmd=initack rc=0
<-pmi- 1: cmd=set rc=0 size=2
<-pmi- 1: cmd=set rc=0 rank=1
<-pmi- 1: cmd=set rc=0 debug=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 1: cmd=response_to_init rc=0
-pmi-> 1: cmd=get_maxes
<-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64






vallen_max=64






-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#38675$description#(none)$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#38697$description#(none)$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0



value=port#38697$description#(none)$



-pmi-> 1: cmd=get kvsname=kvs_0 key=P0-businesscard
<-pmi- 1: cmd=get_result rc=0



value=port#38675$description#(none)$




Hello world from process 1 of 2
Hello world from process 0 of 2


It looks like it ran, but then it hangs and never returns.

If I try to run another example (cpi), here is the output
from






the run






with a single process, and then with two:

[EMAIL PROTECTED] examples]# xmvapich n0001 ./cpi
Process 0 of 1 is on (none)
pi is approximately 3.1415926544231341, Error is






0.0000000008333410






wall clock time = 0.000313
[EMAIL PROTECTED] examples]# xmvapich -D n0001 ./cpi
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=1
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64






vallen_max=64






-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#48513$description#(none)$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 0: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
-pmi-> 0: cmd=finalize
<-pmi- 0: cmd=finalize_ack rc=0
Process 0 of 1 is on (none)
pi is approximately 3.1415926544231341, Error is






0.0000000008333410






wall clock time = 0.000332
[EMAIL PROTECTED] examples]

normal termination.

[EMAIL PROTECTED] examples]# xmvapich -D n0000,n0001 ./cpi
-pmi-> 0: cmd=initack pmiid=1
<-pmi- 0: cmd=initack rc=0
<-pmi- 0: cmd=set rc=0 size=2
<-pmi- 0: cmd=set rc=0 rank=0
<-pmi- 0: cmd=set rc=0 debug=0
-pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 0: cmd=response_to_init rc=0
-pmi-> 0: cmd=get_maxes
<-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64






vallen_max=64






-pmi-> 0: cmd=get_appnum
<-pmi- 0: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=initack pmiid=1
<-pmi- 1: cmd=initack rc=0
<-pmi- 1: cmd=set rc=0 size=2
<-pmi- 1: cmd=set rc=0 rank=1
<-pmi- 1: cmd=set rc=0 debug=0
-pmi-> 0: cmd=get_my_kvsname
<-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1
<-pmi- 1: cmd=response_to_init rc=0
-pmi-> 1: cmd=get_maxes
<-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64






vallen_max=64






-pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard
value=port#45645$description#(none)$
<-pmi- 0: cmd=put_result rc=0
-pmi-> 1: cmd=get_appnum
<-pmi- 1: cmd=appnum rc=0 appnum=0
-pmi-> 0: cmd=barrier_in
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=get_my_kvsname
<-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0
-pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard
value=port#53467$description#(none)$
<-pmi- 1: cmd=put_result rc=0
-pmi-> 1: cmd=barrier_in
<-pmi- 0: cmd=barrier_out rc=0
<-pmi- 1: cmd=barrier_out rc=0
-pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard
<-pmi- 0: cmd=get_result rc=0



value=port#53467$description#(none)$



Process 0 of 2 is on (none)
Process 1 of 2 is on (none)

hung processes....


Daniel


On Wed, Oct 8, 2008 at 3:23 PM, Latchesar Ionkov






<[EMAIL PROTECTED]>












wrote:











I can't replicate it, it is working fine here :(
Can you please try xmvapich again with -D option and
cut&paste







the













output?







Thanks,
Lucho

On Oct 6, 2008, at 2:51 PM, Daniel Gruner wrote:






I just compiled mpich2-1.1.0a1, and tested it, with the
same










result as










with mvapich.  Again I had to do the configure with
--with-device=ch3:sock, since otherwise the runtime








complains that


















it










can't allocate shared memory or some such thing.  When I
run








a


















single










process using xmvapich it completes fine.  However when








running


















two or










more it hangs.  This is not surprising as it should be
the








same as








mvapich when running over regular TCP/IP on GigE rather
than








a


















special










interconnect.

[EMAIL PROTECTED] examples]# ./hellow
Hello world from process 0 of 1
[EMAIL PROTECTED] examples]# xmvapich -a ./hellow
Hello world from process 1 of 2
Hello world from process 0 of 2
^C
[EMAIL PROTECTED] examples]# xmvapich n0000 ./hellow
Hello world from process 0 of 1
[EMAIL PROTECTED] examples]# xmvapich n0001 ./hellow
Hello world from process 0 of 1
[EMAIL PROTECTED] examples]# xmvapich n0000,n0001 ./hellow
Hello world from process 1 of 2
Hello world from process 0 of 2
^C

Daniel



On 10/6/08, Latchesar Ionkov <[EMAIL PROTECTED]> wrote:





I just compiled mpich2-1.1.0a1 and tried running
hellow,












everything
























looks












fine:

$ xmvapich m1,m2










~/work/mpich2-1.1.0a1/build/examples/hellow









Hello world from process 0 of 2
Hello world from process 1 of 2
$

I didn't set any special parameters when compiling,
just












./configure.













Thanks,
Lucho


On Oct 3, 2008, at 9:05 AM, Daniel Gruner wrote:







Well, I just did the same, but with NO success...
The














processes





























are















apparently started, run at the beginning, but then
they










hang
























and





























do















not finalize.  For example, running the "hellow"
example










from
























the














mvapich2 distribution:

[EMAIL PROTECTED] examples]# cat hellow.c
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
*  (C) 2001 by Argonne National Laboratory.
*      See COPYRIGHT in top-level directory.
*/

#include <stdio.h>
#include "mpi.h"

int main( int argc, char *argv[] )
{
int rank;
int size;

MPI_Init( 0, 0 );
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank,










size );










MPI_Finalize();
return 0;
}

[EMAIL PROTECTED] examples]# make hellow
../bin/mpicc  -I../src/include -I../src/include   -c










hellow.c










../bin/mpicc   -o hellow hellow.o
[EMAIL PROTECTED] examples]# ./hellow
Hello world from process 0 of 1

(this was fine, just running on the master).
Running on










the
























two





























nodes















requires that the xmvapich process be killed
(ctrl-C):

[EMAIL PROTECTED] examples]# xmvapich -ap ./hellow
n0000: Hello world from process 0 of 2
n0001: Hello world from process 1 of 2
[EMAIL PROTECTED] examples]#

I have tried other codes, both in C and Fortran,
with










the same










behaviour.  I don't know if the issue is with
xmvapich










or with










mvapich2.  Communication is just GigE.

Daniel


On 9/30/08, Abhishek Kulkarni <[EMAIL PROTECTED]>
wrote:






Just gave this a quick try, and xmvapich seems to
run











MPI



























apps


































compiled


















with mpich2 without any issues.

$ xmvapich -a ./mpihello
blender: Hello World from process 0 of 1
eregion: Hello World from process 0 of 1

Hope that helps,


-- Abhishek


On Tue, 2008-09-30 at 17:02 +0200, Stefan Boresch











wrote:
















Thanks for the quick reply!

On Tue, Sep 30, 2008 at 07:34:37AM -0700, ron












minnich






























wrote:
























On Tue, Sep 30, 2008 at 1:57 AM, stefan
























<[EMAIL PROTECTED]>









































wrote:






















the state of xcpu support with MPI libraries
--














either




































of

















































the
















































common





















free ones
is fine (e.g., openmpi, mpich2)







there is now support for mpich2. openmpi is
not




















supported as












































openmpi
























is (once again) in flux. it has been supported













numerous





































times and









































has

















changed out from under us numerous times. I no













longer

































use












































openmpi if









































I

















have a working mvapich or mpich available.







I am slightly confused. I guess I had inferred
the












openmpi

































issues from





















the various mailing lists. But I just looked at
the












latest

































mpich2





















prerelease and found no mentioning of
(s)xcpu(2). I


















thought







































that some





















patches/support on the side of the mpi library
are


















necessary







































(as,


































e.g.,













openmpi provides for bproc ...)  Or am I
completely





















misunderstanding





















something here, and this is somehow handled by
xcpu












itself






























...













...

[Message clipped]

Reply via email to