Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Steven Senator (slurm-dev-list)
Also consider the --no-kill ("-k") options to sbatch (and srun.) Following from the sbatch man page. -k, --no-kill [=off] Do not automatically terminate a job if one of the nodes it has been allocated fails. The user will assume the responsibilities for

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Rodrigo Santibáñez
What about instead of (automatic) requeue of the job, use --no-requeue in the first sbatch and when something went wrong with the job (why not something wrong with the node?) submit again with --no-requeue the job with the excluded nodes? something as: sbatch --no-requeue file.sh, and then sbatch

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
Not quite. The user’s job script in question is checking the error status of the program it ran while it is running. If a program fails the running job wants to exclude the machine it is currently running on and requeue itself in case it died due to a local machine issue that the scheduler has

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Riebs, Andy
Geoffrey, A lot depends on what you mean by “failure on the current machine”. If it’s a failure that Slurm recognizes as a failure, Slurm can be configured to remove the node from the partition, and you can follow Rodrigo’s suggestions for the requeue options. If the user job simply decides

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Rodrigo Santibáñez
Hello, Jobs can be requeue if something wrong happens, and the node with failure excluded by the controller. *--requeue* Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher

[slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
Hello We are moving from Univa(sge) to slurm and one of our users has jobs that if they detect a failure on the current machine they add that machine to their exclude list and requeue themselves. The user wants to emulate that behavior in slurm. It seems like "scontrol update job

[slurm-users] Job failure issue in Slurm

2020-06-04 Thread navin srivastava
Hi Team, i am seeing a weird issue in my environment. one of the gaussian job is failing with the slurm within a minute after it go for the execution without writing anything and unable to figure out the reason. The same job works fine without slurm on the same node. slurmctld.log

Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-04 Thread Bjørn-Helge Mevik
Corey Keasling writes: > The documentation only refers to GrpGRESRunMins, but I can't figure > out what I might substitute for GRES that means Memory in the same way > that substituting CPU means, well, CPUs. Google turns up precisely > nothing for GrpMemRunMins... Am I missing something?

[slurm-users] How to view GPU indices of the completed jobs?

2020-06-04 Thread Kota Tsuyuzaki
Hello Guys, We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) and some of GPUs seemed to get troubles for attached jobs. To investigate if the troubles happened on the same GPUs, I'd like to get GPU indices of the completed jobs. In my understanding `scontrol show job`