Am 18.04.2011 um 17:35 schrieb Stefano Bridi: > On Mon, Apr 18, 2011 at 5:10 PM, Reuti <[email protected]> wrote: >> Hi, >> >> yes, Intel MPI is still based on the mpd. They will use MPICH2 Hydra’s >> startup at some point in the future for sure. >> >>> <snip> >>> -catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile >>> /sw/intel/impi/4.0.1.007/intel64 >>> n0004:12 >>> startmpich2.sh: check for local mpd daemon (1 of 10) >>> /sge/bin/lx24-amd64/qrsh -inherit -V n0004 >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd >>> startmpich2.sh: check for local mpd daemon (2 of 10) >>> startmpich2.sh: check for mpd daemons (1 of 10) >>> startmpich2.sh: got all 1 of 1 nodes >> >> Okay, it built the ring. You also set $MPD_TMPDIR inside >> start/stop_proc_args? > > Sorry, I forgot to write here: I've modified the startmpich2.sh and > stopmpich2.sh scripts adding > export MPD_TMPDIR=$TMPDIR > after the > export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID" > statement. > >> It was only added just before Hydra was introduced in MPICH2, I'm not sure >> whether it's in Intel’s MPI. Can you put a sleep in the jobscript and check >> what's on the nodes in /tmp resp. $TMPDIR of the job? > > in "/tmp" there is nothing directly connected with the user which is > running and in the $TMPDIR I have > > -rw-r--r-- 1 stef localusers 17 Apr 18 17:19 machines > srwxrwx--- 1 stef localusers 0 Apr 18 17:19 > mpd2.console_n0004_stef_sge_20737.undefined > -rwxr----- 1 stef localusers 5 Apr 18 17:19 pid.1.n0004 > lrwxrwxrwx 1 stef localusers 19 Apr 18 17:19 rsh -> /sge/mpich2_mpd/rsh > > >> `mpdtrace` is also used in start_proc_args, so this should (and must) work >> in the jobscript too. If this is failing, maybe it's getting a wrong version >> of it. > > mpdtrace is working (if I understand correctly the output) also in the > jobscript. > >> As you use `ssh` instead of the -builtin- method (you need X11 forwarding?), >> you set up passpharse-less or hostbased authentication? Does: > > Yes, the ssh password-less and passphrase-less is working fine. > >> mpiexec hostname >> >> in the jobscript work - your application was built with the same version of >> Intel MPI? >> >> -- Reuti > > This is the point: everything seems to be ok but a "simple" mpiexec > hostname cause the mpd to crash with the message reported below. > If in the jobscript I start the ring (mpdboot) and the start the > mpiexec and the stop the ring (mdpallexit), all work as expected so I > suppose that the version and so on are correct..
Just for curiosity: could you download an older version of MPICH2 (i.e. 1.2.1p1) and try with their mpd method? In 1.3. they use Hydra already. -- Reuti > > > Thanks > Stefano > >> >> >>> n0004_55051: mpd_uncaught_except_tb handling: >>> exceptions.KeyError: 'job_abort_signal' >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3237 run_one_cli >>> man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal'] >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2969 do_mpdrun >>> rv = self.run_one_cli(rank,msg) >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2196 handle_console_input >>> self.do_mpdrun(msg) >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py 940 >>> handle_active_streams >>> handler(stream,*args) >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1798 runmainloop >>> rv = self.streamHandler.handle_active_streams(timeout=8.0) >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1762 run >>> self.runmainloop() >>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3446 ? >>> mpd.run() >>> -catch_rsh /sw/intel/impi/4.0.1.007/intel64 >>> mpdallexit: cannot connect to local mpd >>> (/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined); >>> possible causes: >>> 1. no mpd is running on this host >>> 2. an mpd is running but was started without a "console" (-n option) >>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- >>> >>> I've also tried interactively via qrsh and what happen is: >>> the mpdtrace/mpdcheck inside the script seems to work correctly but >>> the mpiexec command fail and produce the crash also in the "mpd" >>> started via the "startmpich2.sh" "parallel environment start script" >>> If I uncomment in the script the lines which stop the mdp ring and >>> recreate it inside the job script and stop it after the mpiexec, the >>> mpiexec command works fine but obviously the tight integration is >>> gone... >>> This is my first "thight integration" setup so probably I've missed >>> some configuration.. >>> Can somebody help or point me to something to read? >>> >>> Thanks >>> Stefano >>> >>> >>> Some other information if needed: >>> >>> I'm using "ssh" as a remote shell and I've modified the "qconf -mconf" >>> config in this way: >>> >>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- >>> <cut> >>> execd_params H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE >>> <cut> >>> qlogin_command /sw/lib/scripts/qlogin_wrapper >>> qlogin_daemon /usr/sbin/sshd -i >>> rlogin_command /usr/bin/ssh >>> rlogin_daemon /usr/sbin/sshd -i >>> rsh_command /usr/bin/ssh >>> rsh_daemon /usr/sbin/sshd -i >>> <cut> >>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- >>> >>> and the qlogin_wrapper is simply: >>> >>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- >>> #!/bin/sh >>> >>> HOST=$1 >>> PORT=$2 >>> >>> /usr/bin/ssh -X -p $PORT $HOST >>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
