Hi, all I can not get MPICH2 tight integration working with the tight integration setup I've just tested.
I'm using * SGE 6.2u5 * Intel Cluster toolkit 4.0.0.020 Following this howto: http://arc.liv.ac.uk/SGE/howto/mpich2-integration/mpich2-integration.html I've setup the "impi" parallel environment: ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- pe_name impi slots 24 user_lists NONE xuser_lists NONE start_proc_args /sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \ /sw/intel/impi/4.0.1.007/intel64 stop_proc_args /sge/mpich2_mpd/stopmpich2.sh -catch_rsh \ /sw/intel/impi/4.0.1.007/intel64 allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- and I'm using this simple test job script: ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- #!/bin/sh #$ -N impi-test #$ -cwd #$ -pe impi 12 #$ -q short source /sw/intel/ictce/4.0.0.020/ictvars.sh intel64 export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID" export MPD_TMPDIR=$TMPDIR mpdtrace -l mpdcheck -v -l #mpdallexit #mpdboot -r ssh -f $TMPDIR/machines mpiexec -machinefile $TMPDIR/machines -n $NSLOTS hostname #mpdallexit exit 0 ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- Running this with a "qsub test_script" I get a ".o" in this way ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- n0004_55051 (10.120.1.4) obtaining hostname via gethostname and getfqdn gethostname gives n0004 getfqdn gives n0004 checking out unqualified hostname; make sure is not "localhost", etc. checking out qualified hostname; make sure is not "localhost", etc. obtain IP addrs via qualified and unqualified hostnames; make sure other than 127.0.0.1 gethostbyname_ex: ('n0004', [], ['10.120.1.4']) gethostbyname_ex: ('n0004', [], ['10.120.1.4']) checking that IP addrs resolve to same host now do some gethostbyaddr and gethostbyname_ex for machines in hosts file obtain IP addrs via localhost name; make sure that it equal to 127.0.0.1 gethostbyname_ex: ('localhost.localdomain', ['localhost'], ['127.0.0.1']) mpiexec_n0004 (mpiexec 1034): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_stef log file on each node of the ring. ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- and a ".po" like this: ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- -catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile /sw/intel/impi/4.0.1.007/intel64 n0004:12 startmpich2.sh: check for local mpd daemon (1 of 10) /sge/bin/lx24-amd64/qrsh -inherit -V n0004 /sw/intel/impi/4.0.1.007/intel64/bin/mpd startmpich2.sh: check for local mpd daemon (2 of 10) startmpich2.sh: check for mpd daemons (1 of 10) startmpich2.sh: got all 1 of 1 nodes n0004_55051: mpd_uncaught_except_tb handling: exceptions.KeyError: 'job_abort_signal' /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3237 run_one_cli man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal'] /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2969 do_mpdrun rv = self.run_one_cli(rank,msg) /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2196 handle_console_input self.do_mpdrun(msg) /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py 940 handle_active_streams handler(stream,*args) /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1798 runmainloop rv = self.streamHandler.handle_active_streams(timeout=8.0) /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1762 run self.runmainloop() /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3446 ? mpd.run() -catch_rsh /sw/intel/impi/4.0.1.007/intel64 mpdallexit: cannot connect to local mpd (/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- I've also tried interactively via qrsh and what happen is: the mpdtrace/mpdcheck inside the script seems to work correctly but the mpiexec command fail and produce the crash also in the "mpd" started via the "startmpich2.sh" "parallel environment start script" If I uncomment in the script the lines which stop the mdp ring and recreate it inside the job script and stop it after the mpiexec, the mpiexec command works fine but obviously the tight integration is gone... This is my first "thight integration" setup so probably I've missed some configuration.. Can somebody help or point me to something to read? Thanks Stefano Some other information if needed: I'm using "ssh" as a remote shell and I've modified the "qconf -mconf" config in this way: ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- <cut> execd_params H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE <cut> qlogin_command /sw/lib/scripts/qlogin_wrapper qlogin_daemon /usr/sbin/sshd -i rlogin_command /usr/bin/ssh rlogin_daemon /usr/sbin/sshd -i rsh_command /usr/bin/ssh rsh_daemon /usr/sbin/sshd -i <cut> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- and the qlogin_wrapper is simply: ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- #!/bin/sh HOST=$1 PORT=$2 /usr/bin/ssh -X -p $PORT $HOST ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
