Am 18.04.2011 um 17:35 schrieb Stefano Bridi:

> On Mon, Apr 18, 2011 at 5:10 PM, Reuti <[email protected]> wrote:
>> Hi,
>> 
>> yes, Intel MPI is still based on the mpd. They will use MPICH2 Hydra’s 
>> startup at some point in the future for sure.
>> 
>>> <snip>
>>> -catch_rsh /sge/default/spool/n0004/active_jobs/20715.1/pe_hostfile
>>> /sw/intel/impi/4.0.1.007/intel64
>>> n0004:12
>>> startmpich2.sh: check for local mpd daemon (1 of 10)
>>> /sge/bin/lx24-amd64/qrsh -inherit -V n0004
>>> /sw/intel/impi/4.0.1.007/intel64/bin/mpd
>>> startmpich2.sh: check for local mpd daemon (2 of 10)
>>> startmpich2.sh: check for mpd daemons (1 of 10)
>>> startmpich2.sh: got all 1 of 1 nodes
>> 
>> Okay, it built the ring. You also set $MPD_TMPDIR inside 
>> start/stop_proc_args?
> 
> Sorry, I forgot to write here: I've modified the startmpich2.sh and
> stopmpich2.sh scripts adding
> export MPD_TMPDIR=$TMPDIR
> after the
> export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
> statement.
> 
>> It was only added just before Hydra was introduced in MPICH2, I'm not sure 
>> whether it's in Intel’s MPI. Can you put a sleep in the jobscript and check 
>> what's on the nodes in /tmp resp. $TMPDIR of the job?
> 
> in "/tmp" there is nothing directly connected with the user which is
> running and in the $TMPDIR I have
> 
> -rw-r--r-- 1 stef localusers   17 Apr 18 17:19 machines
> srwxrwx--- 1 stef localusers    0 Apr 18 17:19
> mpd2.console_n0004_stef_sge_20737.undefined
> -rwxr----- 1 stef localusers    5 Apr 18 17:19 pid.1.n0004
> lrwxrwxrwx 1 stef localusers   19 Apr 18 17:19 rsh -> /sge/mpich2_mpd/rsh
> 
> 
>> `mpdtrace` is also used in start_proc_args, so this should (and must) work 
>> in the jobscript too. If this is failing, maybe it's getting a wrong version 
>> of it.
> 
> mpdtrace is working (if I understand correctly the output) also in the
> jobscript.
> 
>> As you use `ssh` instead of the -builtin- method (you need X11 forwarding?), 
>> you set up passpharse-less or hostbased authentication? Does:
> 
> Yes, the ssh password-less and passphrase-less is working fine.
> 
>> mpiexec hostname
>> 
>> in the jobscript work - your application was built with the same version of 
>> Intel MPI?
>> 
>> -- Reuti
> 
> This is the point: everything seems to be ok but a "simple" mpiexec
> hostname cause the mpd to crash with the message reported below.
> If in the jobscript I start the ring (mpdboot) and the start the
> mpiexec and the stop the ring (mdpallexit), all work as expected so I
> suppose that the version and so on are correct..

Just for curiosity: could you download an older version of MPICH2 (i.e. 
1.2.1p1) and try with their mpd method?

In 1.3. they use Hydra already.

-- Reuti


> 
> 
> Thanks
> Stefano
> 
>> 
>> 
>>> n0004_55051: mpd_uncaught_except_tb handling:
>>>  exceptions.KeyError: 'job_abort_signal'
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3237  run_one_cli
>>>        man_env['MPDMAN_JOB_ABORT_SIGNAL'] = msg['job_abort_signal']
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2969  do_mpdrun
>>>        rv = self.run_one_cli(rank,msg)
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  2196  handle_console_input
>>>        self.do_mpdrun(msg)
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpdlib.py  940  
>>> handle_active_streams
>>>        handler(stream,*args)
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1798  runmainloop
>>>        rv = self.streamHandler.handle_active_streams(timeout=8.0)
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  1762  run
>>>        self.runmainloop()
>>>    /sw/intel/impi/4.0.1.007/intel64/bin/mpd  3446  ?
>>>        mpd.run()
>>> -catch_rsh /sw/intel/impi/4.0.1.007/intel64
>>> mpdallexit: cannot connect to local mpd
>>> (/tmp/20715.1.short/mpd2.console_n0004_stef_sge_20715.undefined);
>>> possible causes:
>>>  1. no mpd is running on this host
>>>  2. an mpd is running but was started without a "console" (-n option)
>>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
>>> 
>>> I've also tried interactively via qrsh and what happen is:
>>> the mpdtrace/mpdcheck inside the script seems to work correctly but
>>> the mpiexec command fail and produce the crash also in the "mpd"
>>> started via the "startmpich2.sh" "parallel environment start script"
>>> If I uncomment in the script the lines which stop the mdp ring and
>>> recreate it inside the job script and stop it after the mpiexec, the
>>> mpiexec command works fine but obviously the tight integration is
>>> gone...
>>> This is my first "thight integration" setup so probably I've missed
>>> some configuration..
>>> Can somebody help or point me to something to read?
>>> 
>>> Thanks
>>> Stefano
>>> 
>>> 
>>> Some other information if needed:
>>> 
>>> I'm using "ssh" as a remote shell and I've modified the "qconf -mconf"
>>> config in this way:
>>> 
>>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
>>> <cut>
>>> execd_params                 H_MEMORYLOCKED=infinity ENABLE_ADDGRP_KILL=TRUE
>>> <cut>
>>> qlogin_command               /sw/lib/scripts/qlogin_wrapper
>>> qlogin_daemon                /usr/sbin/sshd -i
>>> rlogin_command               /usr/bin/ssh
>>> rlogin_daemon                /usr/sbin/sshd -i
>>> rsh_command                  /usr/bin/ssh
>>> rsh_daemon                   /usr/sbin/sshd -i
>>> <cut>
>>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
>>> 
>>> and the qlogin_wrapper is simply:
>>> 
>>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
>>> #!/bin/sh
>>> 
>>> HOST=$1
>>> PORT=$2
>>> 
>>> /usr/bin/ssh -X -p $PORT $HOST
>>> ---8<--------8<--------8<-------8<--------8<--------8<-------8<--------8<--------8<----
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to