On 24/05/17 13:45, Lachlan Musicman wrote:
> Not yet - that's part of the next update cycle :/
Ah well that might help, along with pam_slurm_adopt so that users
SSH'ing into nodes they have jobs on are put into a cgroup of theirs on
that node.
Helps catch any legacy SSH based MPI launchers
On 24 May 2017 at 13:18, Christopher Samuel wrote:
>
> Hiya,
>
> On 24/05/17 13:10, Lachlan Musicman wrote:
>
> > Occasionally I'll see a bunch of processes "running" (sleeping) on a
> > node well after the job they are associated with has finished.
> >
> > How does this
Hiya,
On 24/05/17 13:10, Lachlan Musicman wrote:
> Occasionally I'll see a bunch of processes "running" (sleeping) on a
> node well after the job they are associated with has finished.
>
> How does this happen - does slurm not make sure all processes spawned by
> a job have finished at
Hola,
Occasionally I'll see a bunch of processes "running" (sleeping) on a node
well after the job they are associated with has finished.
How does this happen - does slurm not make sure all processes spawned by a
job have finished at completion?
cheers
L.
--
"Mission Statement: To provide
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
David,
Are you running any epilog functions that may be placing the nodes into a
drained/draining state?
John DeSantis
Baker D.J. wrote:
> Hello,
>
> I've recently started using slurm v17.02.2, however something seems very odd.
> For some
Manuel Rodríguez Pascual writes:
> After working with the developers of DMTCP checkpoint library, we have a
> nice working version of Slurm+DMTCP. We are able to checkpoint any batch
> job (well, most of them) and restarting it anywhere else in the cluster. We
While on the subject, I have to add my 2 cents. I ran into the same
problem yesterday.
In sbatch man pages I read:
" -t, --time=
Set a limit on the total run time of the job allocation. If the
requested time limit exceeds the partition's time limit, the job will be
left in a PENDING state
I'd like to configure E-mail notifications of failing nodes. I already
use the LBL NHC (Node Health Check) on the compute nodes to send alerts,
but one may also use the Slurm strigger mechanism on the slurmctld host.
The examples in http://slurm.schedmd.com/strigger.html are quite
Hi,
Is it possible to specify the GRES type with MaxTRESPerNode? E.g:
slurm.conf
(…)
GresType=gpu
(…)
NodeName=gpu[01,02] NodeAddr=10.2.1.[1,2] Gres=gpu:tesla:2,gpu:kepler:2 ...
(…)
PartitionName=gpu_one Nodes=gpu[01,02] QOS=part_gpu_one …
PartitionName=gpu_two Nodes=gpu[01,02] QOS=part_gpu_two