[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
On 24/05/17 13:45, Lachlan Musicman wrote: > Not yet - that's part of the next update cycle :/ Ah well that might help, along with pam_slurm_adopt so that users SSH'ing into nodes they have jobs on are put into a cgroup of theirs on that node. Helps catch any legacy SSH based MPI launchers

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman
On 24 May 2017 at 13:18, Christopher Samuel wrote: > > Hiya, > > On 24/05/17 13:10, Lachlan Musicman wrote: > > > Occasionally I'll see a bunch of processes "running" (sleeping) on a > > node well after the job they are associated with has finished. > > > > How does this

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
Hiya, On 24/05/17 13:10, Lachlan Musicman wrote: > Occasionally I'll see a bunch of processes "running" (sleeping) on a > node well after the job they are associated with has finished. > > How does this happen - does slurm not make sure all processes spawned by > a job have finished at

[slurm-dev] Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman
Hola, Occasionally I'll see a bunch of processes "running" (sleeping) on a node well after the job they are associated with has finished. How does this happen - does slurm not make sure all processes spawned by a job have finished at completion? cheers L. -- "Mission Statement: To provide

[slurm-dev] Re: Compute nodes going to drained/draining state

2017-05-23 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David, Are you running any epilog functions that may be placing the nodes into a drained/draining state? John DeSantis Baker D.J. wrote: > Hello, > > I've recently started using slurm v17.02.2, however something seems very odd. > For some

[slurm-dev] Re: thoughts on task preemption

2017-05-23 Thread Bjørn-Helge Mevik
Manuel Rodríguez Pascual writes: > After working with the developers of DMTCP checkpoint library, we have a > nice working version of Slurm+DMTCP. We are able to checkpoint any batch > job (well, most of them) and restarting it anywhere else in the cluster. We

[slurm-dev] Re: PartitionTimeLimit : what does that mean?

2017-05-23 Thread Barbara Krašovec
While on the subject, I have to add my 2 cents. I ran into the same problem yesterday. In sbatch man pages I read: " -t, --time= Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state

[slurm-dev] Tools for using strigger to monitor nodes?

2017-05-23 Thread Ole Holm Nielsen
I'd like to configure E-mail notifications of failing nodes. I already use the LBL NHC (Node Health Check) on the compute nodes to send alerts, but one may also use the Slurm strigger mechanism on the slurmctld host. The examples in http://slurm.schedmd.com/strigger.html are quite

[slurm-dev] Specify GRES Type with MaxTRESPerNode

2017-05-23 Thread nico.faerber
Hi, Is it possible to specify the GRES type with MaxTRESPerNode? E.g: slurm.conf (…) GresType=gpu (…) NodeName=gpu[01,02] NodeAddr=10.2.1.[1,2] Gres=gpu:tesla:2,gpu:kepler:2 ... (…) PartitionName=gpu_one Nodes=gpu[01,02] QOS=part_gpu_one … PartitionName=gpu_two Nodes=gpu[01,02] QOS=part_gpu_two