Hi list, there was a similar thread last year w/o a solution posted w/ subject: "(s)xcpu and MPI".
Daniel Gruner reported problems w/ running mpi-jobs (cpi) on multiple nodes. Part of the problem was the name-resolving / hostnames but the thread ended with at least 2 open problems: a) running a job on the same node did not work, i.e. hang: "xmvapich node1,node1 ./cpi" b) running a job on the headnode did not work either My setup is a follows: - 1 headnode (the headnode is part of the xcpu-cluster as "n00") - 2 compute-nodes (n01 and n02) - interconnect: ethernet - c-nodes boot a initramfs via PXE - mpich2-1.0.3 - hostnames are ok: xcpu-head01 examples # xrx -pa hostname n00: xcpu-head01.local n02: n02 n01: n01 - basic mpi-jobs too: xcpu-head01 examples # xmvapich -a ./hellow Hello world from process 0 of 3 Hello world from process 2 of 3 Hello world from process 1 of 3 - more complex ones not: xcpu-head01 examples # xmvapich -a ./cpi Process 0 of 3 is on xcpu-head01.local Process 2 of 3 is on n02 Process 1 of 3 is on n01 (hang) ^C - BUT running on only one node is ok: xcpu-head01 examples # xmvapich n01 ./cpi Process 0 of 1 is on n01 pi is approximately 3.1415926544231341, Error is 0.0000000008333410 wall clock time = 0.000258 - running 2 procs on the head works: xcpu-head01 examples # xmvapich n00,n00 ./cpi Process 0 of 2 is on xcpu-head01.local pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.001787 Process 1 of 2 is on xcpu-head01.local - while running 2 on either n01 or n02 hangs: xcpu-head01 examples # xmvapich -D n01,n02 ./cpi -pmi-> 0: cmd=initack pmiid=1 <-pmi- 0: cmd=initack rc=0 <-pmi- 0: cmd=set rc=0 size=2 <-pmi- 0: cmd=set rc=0 rank=0 <-pmi- 0: cmd=set rc=0 debug=0 -pmi-> 0: cmd=init pmi_version=1 pmi_subversion=1 <-pmi- 0: cmd=response_to_init rc=0 -pmi-> 0: cmd=get_maxes <-pmi- 0: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 -pmi-> 1: cmd=initack pmiid=1 <-pmi- 1: cmd=initack rc=0 <-pmi- 1: cmd=set rc=0 size=2 <-pmi- 1: cmd=set rc=0 rank=1 <-pmi- 1: cmd=set rc=0 debug=0 -pmi-> 0: cmd=get_appnum <-pmi- 0: cmd=appnum rc=0 appnum=0 -pmi-> 0: cmd=get_my_kvsname <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 -pmi-> 1: cmd=init pmi_version=1 pmi_subversion=1 <-pmi- 1: cmd=response_to_init rc=0 -pmi-> 1: cmd=get_maxes <-pmi- 1: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=64 -pmi-> 0: cmd=get_my_kvsname <-pmi- 0: cmd=my_kvsname rc=0 kvsname=kvs_0 -pmi-> 1: cmd=get_appnum <-pmi- 1: cmd=appnum rc=0 appnum=0 -pmi-> 0: cmd=put kvsname=kvs_0 key=P0-businesscard value=port#39217$description#n01$ <-pmi- 0: cmd=put_result rc=0 -pmi-> 0: cmd=barrier_in -pmi-> 1: cmd=get_my_kvsname <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 -pmi-> 1: cmd=get_my_kvsname <-pmi- 1: cmd=my_kvsname rc=0 kvsname=kvs_0 -pmi-> 1: cmd=put kvsname=kvs_0 key=P1-businesscard value=port#55159$description#n02$ <-pmi- 1: cmd=put_result rc=0 -pmi-> 1: cmd=barrier_in <-pmi- 0: cmd=barrier_out rc=0 <-pmi- 1: cmd=barrier_out rc=0 -pmi-> 0: cmd=get kvsname=kvs_0 key=P1-businesscard <-pmi- 0: cmd=get_result rc=0 value=port#55159$description#n02$ Process 0 of 2 is on n01 Process 1 of 2 is on n02 ^C I assume that Daniel has found a solution but unfortunately this did not make it to the list. Any ideas? Thanks, Thomas
