Am 17.06.2014 um 15:23 schrieb Esztermann, Ansgar:

> On Jun 5, 2014, at 12:25 , Reuti <[email protected]> wrote:
> 
>> Am 05.06.2014 um 11:51 schrieb Esztermann, Ansgar:
>> 
>>> Hi everyone,
>>> 
>>> we have a strange problem here where jobs die through SIGKILL (so far, I 
>>> have failed to find out what triggered the signal) but then some processes 
>>> remain on the node. We are using one of the killkids variants, but (at 
>>> least) for multi-node jobs, there are actually *two* gids in use on the job 
>>> master: one for the jobscript, mpiexec.hydra and qsh, and another one for 
>>> qrsh_starter and the actual executables.
>> 
>> This sounds like a problem in the MPI setup. There shouldn't be any local 
>> `qrsh` for recent MPI implementations (if so, you are right: it get's a new 
>> addgrpid). Using actual MPI libraries the local processes should be forked 
>> by the `mpiexec`. Is the name resolution working? I.e. all are using only 
>> the hostname *or* the FQDN?
> 
> After some more detailed probing, this does not seem to be the case: 
> mpiexec.hydra exec()s qrsh once for each host. `hostname`

It was only a test whether by default call of the function will output the 
short name or the FQDN. I still wonder, why this happens.

Can you output the name of the node the process thinks it's running on? Maybe 
this will work, although I'm not sure whether it will output the FQDN at all:

https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Environment_Variables_Job_Startup_Command.htm

$ mpiexec.hydra -env I_MPI_DEBUG 2,pid,host ...

-- Reuti


> is not called, so I've rewritten the machinefile to contain short hostnames 
> only, but to no avail: qrsh is used nonetheless.
> This is IntelMPI 4.2.3.048 (the latest being .049, but no relevant changes in 
> the release notes).
> 
> A.
> -- 
> Ansgar Esztermann
> DV-Systemadministration
> Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to