Hi, yes, Intel MPI is still based on the mpd. They will use MPICH2 Hydra’s startup at some point in the future for sure.
> <snip> > -catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile > /sw/intel/impi/4.0.1.007/intel64 > n0004:12 > startmpich2.sh: check for local mpd daemon (1 of 10) > /sge/bin/lx24-amd64/qrsh -inherit -V n0004 > /sw/intel/impi/4.0.1.007/intel64/bin/mpd > startmpich2.sh: check for local mpd daemon (2 of 10) > startmpich2.sh: check for mpd daemons (1 of 10) > startmpich2.sh: got all 1 of 1 nodes Okay, it built the ring. You also set $MPD_TMPDIR inside start/stop_proc_args? It was only added just before Hydra was introduced in MPICH2, I'm not sure whether it's in Intel’s MPI. Can you put a sleep in the jobscript and check what's on the nodes in /tmp resp. $TMPDIR of the job? `mpdtrace` is also used in start_proc_args, so this should (and must) work in the jobscript too. If this is failing, maybe it's getting a wrong version of it. As you use `ssh` instead of the -builtin- method (you need X11 forwarding?), you set up passpharse-less or hostbased authentication? Does: mpiexec hostname in the jobscript work - your application was built with the same version of Intel MPI? -- Reuti > n0004_55051: mpd_uncaught_except_tb handling: > exceptions.KeyError: 'job_abort_signal' > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3237 run_one_cli > man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal'] > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2969 do_mpdrun > rv = self.run_one_cli(rank,msg) > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 2196 handle_console_input > self.do_mpdrun(msg) > /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py 940 handle_active_streams > handler(stream,*args) > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1798 runmainloop > rv = self.streamHandler.handle_active_streams(timeout=8.0) > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 1762 run > self.runmainloop() > /sw/intel/impi/4.0.1.007/intel64/bin/mpd 3446 ? > mpd.run() > -catch_rsh /sw/intel/impi/4.0.1.007/intel64 > mpdallexit: cannot connect to local mpd > (/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined); > possible causes: > 1. no mpd is running on this host > 2. an mpd is running but was started without a "console" (-n option) > ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- > > I've also tried interactively via qrsh and what happen is: > the mpdtrace/mpdcheck inside the script seems to work correctly but > the mpiexec command fail and produce the crash also in the "mpd" > started via the "startmpich2.sh" "parallel environment start script" > If I uncomment in the script the lines which stop the mdp ring and > recreate it inside the job script and stop it after the mpiexec, the > mpiexec command works fine but obviously the tight integration is > gone... > This is my first "thight integration" setup so probably I've missed > some configuration.. > Can somebody help or point me to something to read? > > Thanks > Stefano > > > Some other information if needed: > > I'm using "ssh" as a remote shell and I've modified the "qconf -mconf" > config in this way: > > ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- > <cut> > execd_params H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE > <cut> > qlogin_command /sw/lib/scripts/qlogin_wrapper > qlogin_daemon /usr/sbin/sshd -i > rlogin_command /usr/bin/ssh > rlogin_daemon /usr/sbin/sshd -i > rsh_command /usr/bin/ssh > rsh_daemon /usr/sbin/sshd -i > <cut> > ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- > > and the qlogin_wrapper is simply: > > ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- > #!/bin/sh > > HOST=$1 > PORT=$2 > > /usr/bin/ssh -X -p $PORT $HOST > ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<---- > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
