[gridengine users] intel MPI tight integration issue.

Stefano Bridi Mon, 18 Apr 2011 07:22:31 -0700

Hi, all
I can not get MPICH2 tight integration working with the tight
integration setup I've just tested.


I'm using
* SGE 6.2u5
* Intel Cluster toolkit 4.0.0.020

Following this howto:
http://arc.liv.ac.uk/SGE/howto/mpich2-integration/mpich2-integration.html
I've setup the "impi" parallel environment:

---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
pe_name            impi
slots              24
user_lists         NONE
xuser_lists        NONE
start_proc_args    /sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
                   /sw/intel/impi/4.0.1.007/intel64
stop_proc_args     /sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
                   /sw/intel/impi/4.0.1.007/intel64
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----

and I'm using this simple test job script:

---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
#!/bin/sh

#$ -N impi-test
#$ -cwd
#$ -pe impi 12
#$ -q short


source /sw/intel/ictce/4.0.0.020/ictvars.sh intel64
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
export MPD_TMPDIR=$TMPDIR

mpdtrace -l
mpdcheck -v -l
#mpdallexit
#mpdboot -r ssh -f $TMPDIR/machines
mpiexec -machinefile $TMPDIR/machines -n $NSLOTS hostname
#mpdallexit

exit 0
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----

Running this with a "qsub test_script" I get a ".o" in this way
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
n0004_55051 (10.120.1.4)
obtaining hostname via gethostname and getfqdn
gethostname gives  n0004
getfqdn gives  n0004
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure
other than 127.0.0.1
gethostbyname_ex:  ('n0004', [], ['10.120.1.4'])
gethostbyname_ex:  ('n0004', [], ['10.120.1.4'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
obtain IP addrs via localhost name;  make sure that it equal to 127.0.0.1
gethostbyname_ex:  ('localhost.localdomain', ['localhost'], ['127.0.0.1'])
mpiexec_n0004 (mpiexec 1034): no msg recvd from mpd when expecting ack
of request. Please examine the /tmp/mpd2.logfile_stef log file on each
node of the ring.
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----

and a ".po" like this:
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
-catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile
/sw/intel/impi/4.0.1.007/intel64
n0004:12
startmpich2.sh: check for local mpd daemon (1 of 10)
/sge/bin/lx24-amd64/qrsh -inherit -V n0004
/sw/intel/impi/4.0.1.007/intel64/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
startmpich2.sh: got all 1 of 1 nodes
n0004_55051: mpd_uncaught_except_tb handling:
  exceptions.KeyError: 'job_abort_signal'
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3237  run_one_cli
        man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal']
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2969  do_mpdrun
        rv = self.run_one_cli(rank,msg)
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2196  handle_console_input
        self.do_mpdrun(msg)
    /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py  940  handle_active_streams
        handler(stream,*args)
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1798  runmainloop
        rv = self.streamHandler.handle_active_streams(timeout=8.0)
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1762  run
        self.runmainloop()
    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3446  ?
        mpd.run()
-catch_rsh /sw/intel/impi/4.0.1.007/intel64
mpdallexit: cannot connect to local mpd
(/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined);
possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----

I've also tried interactively via qrsh and what happen is:
the mpdtrace/mpdcheck inside the script seems to work correctly but
the mpiexec command fail and produce the crash also in the "mpd"
started via the "startmpich2.sh" "parallel environment start script"
If I uncomment in the script the lines which stop the mdp ring and
recreate it inside the job script and stop it after the mpiexec, the
mpiexec command works fine but obviously the tight integration is
gone...
This is my first "thight integration" setup so probably I've missed
some configuration..
Can somebody help or point me to something to read?

Thanks
Stefano


Some other information if needed:

I'm using "ssh" as a remote shell and I've modified the "qconf -mconf"
config in this way:

---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
<cut>
execd_params                 H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE
<cut>
qlogin_command               /sw/lib/scripts/qlogin_wrapper
qlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
rlogin_daemon                /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i
<cut>
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----

and the qlogin_wrapper is simply:

---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
#!/bin/sh

HOST=$1
PORT=$2

/usr/bin/ssh -X -p $PORT $HOST
---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] intel MPI tight integration issue.

Reply via email to