On Tue, Nov 13, 2018 at 05:06:51PM -0700, ad...@genome.arizona.edu wrote: > We have a cluster with gridengine 6.5u2 and noticing a strange behavior when > running MPI jobs. Our application will finish, yet the processes continue > to run and use up the CPU. We did configure a parallel environment for MPI > as follows: > > pe_name mpi > slots 500 > user_lists NONE > xuser_lists NONE > start_proc_args NONE > stop_proc_args NONE > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > Then we have run our application "Maker" like this, > qsub -cwd -N <NAME> -b y -V -pe mpi <CPUs> /opt/mpich-install/bin/mpiexec > maker <maker options> > > It seems to run fine and qstat will show it running. Once it has completed, > qstat is empty again and we have the desired output. However, the "maker" > process have continued to run on the compute nodes until I login to each > node and "kill -9" the processes. We did not have this problem when running > mpiexec directly with Maker, or running Maker in stand-alone mode (without > MPI), so I guess it is a problem with our qsub command or parallel > environment? Any Ideas?
Do you have ENABLE_ADDGRP_KILL set? Can be helpful in killing processes left behind when a job exits. William
signature.asc
Description: PGP signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users