Hi,

yes, Intel MPI is still based on the mpd. They will use MPICH2 Hydra’s startup 
at some point in the future for sure.

> <snip>
> -catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile
> /sw/intel/impi/4.0.1.007/intel64
> n0004:12
> startmpich2.sh: check for local mpd daemon (1 of 10)
> /sge/bin/lx24-amd64/qrsh -inherit -V n0004
> /sw/intel/impi/4.0.1.007/intel64/bin/mpd
> startmpich2.sh: check for local mpd daemon (2 of 10)
> startmpich2.sh: check for mpd daemons (1 of 10)
> startmpich2.sh: got all 1 of 1 nodes

Okay, it built the ring. You also set $MPD_TMPDIR inside start/stop_proc_args?

It was only added just before Hydra was introduced in MPICH2, I'm not sure 
whether it's in Intel’s MPI. Can you put a sleep in the jobscript and check 
what's on the nodes in /tmp resp. $TMPDIR of the job?

`mpdtrace` is also used in start_proc_args, so this should (and must) work in 
the jobscript too. If this is failing, maybe it's getting a wrong version of it.

As you use `ssh` instead of the -builtin- method (you need X11 forwarding?), 
you set up passpharse-less or hostbased authentication? Does:

mpiexec hostname

in the jobscript work - your application was built with the same version of 
Intel MPI?

-- Reuti


> n0004_55051: mpd_uncaught_except_tb handling:
>  exceptions.KeyError: 'job_abort_signal'
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3237  run_one_cli
>        man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal']
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2969  do_mpdrun
>        rv = self.run_one_cli(rank,msg)
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2196  handle_console_input
>        self.do_mpdrun(msg)
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py  940  handle_active_streams
>        handler(stream,*args)
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1798  runmainloop
>        rv = self.streamHandler.handle_active_streams(timeout=8.0)
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1762  run
>        self.runmainloop()
>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3446  ?
>        mpd.run()
> -catch_rsh /sw/intel/impi/4.0.1.007/intel64
> mpdallexit: cannot connect to local mpd
> (/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined);
> possible causes:
>  1. no mpd is running on this host
>  2. an mpd is running but was started without a "console" (-n option)
> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
> 
> I've also tried interactively via qrsh and what happen is:
> the mpdtrace/mpdcheck inside the script seems to work correctly but
> the mpiexec command fail and produce the crash also in the "mpd"
> started via the "startmpich2.sh" "parallel environment start script"
> If I uncomment in the script the lines which stop the mdp ring and
> recreate it inside the job script and stop it after the mpiexec, the
> mpiexec command works fine but obviously the tight integration is
> gone...
> This is my first "thight integration" setup so probably I've missed
> some configuration..
> Can somebody help or point me to something to read?
> 
> Thanks
> Stefano
> 
> 
> Some other information if needed:
> 
> I'm using "ssh" as a remote shell and I've modified the "qconf -mconf"
> config in this way:
> 
> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
> <cut>
> execd_params                 H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE
> <cut>
> qlogin_command               /sw/lib/scripts/qlogin_wrapper
> qlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> rlogin_daemon                /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> rsh_daemon                   /usr/sbin/sshd -i
> <cut>
> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
> 
> and the qlogin_wrapper is simply:
> 
> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
> #!/bin/sh
> 
> HOST=$1
> PORT=$2
> 
> /usr/bin/ssh -X -p $PORT $HOST
> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to