date:20140313

[slurm-dev] Checkpointing with SLURM + MVAPICH2 + BLCR

2014-03-13 Thread Arjun J Rao

It is not exactly clear from the documentation here ( https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html) how it is that I am supposed to checkpoint my jobs launched via SLURM. Say I have launched an MPI Job with the following command srun_cr -N2 -n24 --checkpoint 1 --checkpoint-dir

[slurm-dev]

2014-03-13 Thread Marcin Stolarek

Hi guys, On our cluster we run into situation, when we want change the SlurmdSpoolDir location, do you know any way to do this without draining whole cluster? cheers, marcin Marcin Stolarek Interdisciplinary Center for Mathematical and Computational Modeling,

[slurm-dev] Re: slurmd crashed on some nodes after scontrol reconfigure

2014-03-13 Thread Christopher Samuel

-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/03/14 06:20, Andy Riebs wrote: Has anyone seen this before? slurm.conf is on an NFS server, so it's possible we've got a configuration error there. We've seen this same problem too, lost a heap of jobs to it. :-( - -- Christopher Samuel

[slurm-dev] RE: error: We have more allocated time than is possible...

2014-03-13 Thread Lipari, Don

I wanted to add another reason (just discovered today) for the We have more allocated time than is possible error emitted to the slurmdbd.log. Disclaimer: I found this in an old version (v2.3.3) of Slurm, and can't confirm that the problem can still happen. The slurmctld submits job records

[slurm-dev] missing SLURM environment variables with export=NONE

2014-03-13 Thread Paul.Ryan

Hi All, When we use --export=NONE with sbatch, so that we get a clean environment to work with, some SLURM environment variables don't get set. At least the following: SLURM_JOB_NAME SLURM_NTASKS_PER_NODE SLURM_PRIO_PROCESS SLURM_CPUS_PER_TASK SLURM_SUBMIT_DIR SLURM_SUBMIT_HOST Script is:

[slurm-dev] Checkpointing with SLURM + MVAPICH2 + BLCR

[slurm-dev]

[slurm-dev] Re: slurmd crashed on some nodes after scontrol reconfigure

[slurm-dev] RE: error: We have more allocated time than is possible...

[slurm-dev] missing SLURM environment variables with export=NONE

5 matches

Site Navigation

Mail list logo

Footer information

[slurm-dev] Checkpointing with SLURM + MVAPICH2 + BLCR

[slurm-dev]

[slurm-dev] Re: slurmd crashed on *some* nodes after scontrol reconfigure

[slurm-dev] RE: error: We have more allocated time than is possible...

[slurm-dev] missing SLURM environment variables with export=NONE

5 matches

Mail list logo

[slurm-dev] Re: slurmd crashed on some nodes after scontrol reconfigure