On Jun 5, 2014, at 12:25 , Reuti <[email protected]> wrote:
> > Am 05.06.2014 um 11:51 schrieb Esztermann, Ansgar: > >> Hi everyone, >> >> we have a strange problem here where jobs die through SIGKILL (so far, I >> have failed to find out what triggered the signal) but then some processes >> remain on the node. We are using one of the killkids variants, but (at >> least) for multi-node jobs, there are actually *two* gids in use on the job >> master: one for the jobscript, mpiexec.hydra and qsh, and another one for >> qrsh_starter and the actual executables. > > This sounds like a problem in the MPI setup. There shouldn't be any local > `qrsh` for recent MPI implementations (if so, you are right: it get's a new > addgrpid). Using actual MPI libraries the local processes should be forked by > the `mpiexec`. Is the name resolution working? I.e. all are using only the > hostname *or* the FQDN? After some more detailed probing, this does not seem to be the case: mpiexec.hydra exec()s qrsh once for each host. `hostname` is not called, so I've rewritten the machinefile to contain short hostnames only, but to no avail: qrsh is used nonetheless. This is IntelMPI 4.2.3.048 (the latest being .049, but no relevant changes in the release notes). A. -- Ansgar Esztermann DV-Systemadministration Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
