[slurm-dev] Re: How to strictly limit the memory per CPU

2017-11-03 Thread Bjørn-Helge Mevik
job for the lua job submit plugin (job_submit.lua). It can check what users have specified, write out custom errors or change the settings of jobs. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Bjørn-Helge Mevik
of small, distributed jobs running, and a long queue of pending jobs), I personally wouldn't want schedmd to sacrifice that for making updates of node lists easier. Especially since I haven't seen the problem JinSung Kang reports. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for R

[slurm-dev] Re: Preemtion and signals

2017-10-10 Thread Bjørn-Helge Mevik
rong with how my partitions are defined? That sounds unlikely, IMO. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Preemtion and signals

2017-10-09 Thread Bjørn-Helge Mevik
soon as the signal arrives. I got bit by this behaviour trying to do exactly the same that you did. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Rolling maintenance jobs

2017-08-03 Thread Bjørn-Helge Mevik
will be looking at this feature again. Thanks for the tip! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Rolling maintenance jobs

2017-08-02 Thread Bjørn-Helge Mevik
feature from the node, and then request themself to be requeued. Prior to submit the jobs, we add the "fixme" feature to all nodes needing maintenance. (In reality, our setup is a little mor complex, since it includes reinstalling the os on the nodes, but the principle is the same

[slurm-dev] Re: Prolog and sbatch

2017-07-02 Thread Bjørn-Helge Mevik
ch), but looks like this assumption is wrong? That is right, that is wrong. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: #SBATCH --time= not always overriding default?

2017-06-30 Thread Bjørn-Helge Mevik
dictable, both for the programmer and for the user. It is by design, because people often need to give arguments or options to their jobscript, e.g., sbatch --time=1-0:0:0 myjob.sh inputfile -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-06 Thread Bjørn-Helge Mevik
ric usage. Then you can set FairshareWeight to 0 and use the Grp*Mins parameters to set hard limits. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: thoughts on task preemption

2017-05-23 Thread Bjørn-Helge Mevik
tup, the first option is preferrable; just putting it on the queue and let it wait until it's turn. But of course, there are other setups where the second option would be best. Could you perhaps make it configurable, so a site can choose? -- Regards, Bjørn-Helge Mevik, dr. scient, Department

[slurm-dev] Re: How to get pids of a job

2017-05-16 Thread Bjørn-Helge Mevik
e pids. Not that I know of, but it should be possible to script. > And how to parse the nodelist like "cn[11033,11069],gn[1103-1120]" ? scontrol show hostnames cn[11033,11069],gn[1103-1120] -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-02 Thread Bjørn-Helge Mevik
u.se/~kent/python-hostlist/ by Kent Engström at NSC. It's > simple to install this as an RPM package, see > https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists For the simple case you show, you could just use $ scontrol show hostnames a[095,097-098] a095 a097 a098 -- Regards,

[slurm-dev] Slurmdbd Perl api?

2017-04-03 Thread Bjørn-Helge Mevik
ke use Slurmdb qw(:all SLURMDB_ADD_USER); $what = SLURMDB_ADD_USER(); just gives the error "SLURMDB_ADD_USER is not a valid Slurmdb macro" -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Job-Specific Working Directory on Local Scratch

2017-03-14 Thread Bjørn-Helge Mevik
file names to a dot file in $SCRATCH) - The Epilog copies any registered files back to the job submit dir (it uses "su - $USER" when doing this). - The epilog deletes the directory -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signa

[slurm-dev] Re: Slurm mail domain?

2017-03-02 Thread Bjørn-Helge Mevik
omain config parameter was added in Slurm 17.02. A different option would be to configure your sendmail to accept domain-less mails (and perhaps add a default domain itself). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-16 Thread Bjørn-Helge Mevik
size" [1] (JobAcctGatherParams=UsePss), which is cgroup uses (I believe), and sounds like the best estimate to me. [1] https://en.wikipedia.org/wiki/Proportional_set_size -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Submit job with maximum ntasks per node

2016-12-14 Thread Bjørn-Helge Mevik
Check out the thread on this list about a week ago, titled "Unrestricted use of a node". (In short, --exclusive with --mem=0 or --mem-per-cpu=0 might be more or less what you want.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
cpus it is allowed to use.) This is on 15.08.12. YMMV. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
tell this Slurm. - OK, I can ask for 24 > cores and 64 GB in a node, but then I do not get the chance to run on 12 > cores/32 GB. For the memory part, you could specify --mem=0. That will allocate all of the memory on whichever node the job lands on. For the number of cores, I don't know.

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
Jordan Willis writes: >Thank you, >Can you confirm that this will take an update from SLURM 14.11.15 to >current? I never ran 14.11, but in 14.03, you can use GrpCPUs=1000 instead of GrpTRES=cpu=1000. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
access to more then one account, I can use 1000 cpus in each account. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] No longer possible to use scancel in PrologSlurmctld?

2016-11-21 Thread Bjørn-Helge Mevik
#x27;d guess it should have been possible to use scancel in PrologSlurmctld also in 15.08.12. Does anyone know if this is an intentional change (and SchedMD just forgot to update the docs) or a bug? (I haven't found anything relevant in the NEWS file or on bugs.schedmd.com.) -- Regards, B

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-02 Thread Bjørn-Helge Mevik
apart from that, this is my understanding too. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-02 Thread Bjørn-Helge Mevik
There is a plugin under development, that will/might provide those features. It was presented at SLUG 16: http://slurm.schedmd.com/SLUG16/MCS.pdf -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: How to use the EpilogSlurmctld to print job statistics

2016-10-13 Thread Bjørn-Helge Mevik
to generate > my report. Does this approach make sense or are there better > alternatives. sacct can also give you the submit time, start time, end time and elapsed time. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Prolog script (maybe) question?

2016-09-15 Thread Bjørn-Helge Mevik
To me, this sounds like a job for a job submit plugin, for instance job_submit.lua. That way you could reject the job before it gets submitted into the queue. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: The canonical way to write to user's output (stderr) log file on end of job

2016-08-30 Thread Bjørn-Helge Mevik
;, which prints out resource usage. As long as users remember to source the setup file, they get the usage statistics in the bottom of their stdout file. Not very elegant, but it works. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: QoS TRES limits

2016-08-02 Thread Bjørn-Helge Mevik
low" qos, and to get that to work, we've found that we must put the accounts normal limits on a qos, not on the account itself. Usually this means that we have a qos for each account, and then a common "low" qos. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: using gdb to debug slurm-15.08?

2016-04-27 Thread Bjørn-Helge Mevik
md appear to be Just a note: I tried this (for a different reason), but found out it didn't have any effect (gather the output to a log file and look at the gcc lines). However, if I did -D '%with_cflags CFLAGS="-O0 -g3"' (i.e., removed the initial "_"), it had the

[slurm-dev] Slurm no longer optimized by default

2016-04-25 Thread Bjørn-Helge Mevik
I just noticed that as of 14.11.6, optimization is turned off (-O0) by default when building slurm. Is there any reason not to use --disable-debug when building slurm for a production cluster? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-29 Thread Bjørn-Helge Mevik
+1 -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] What cluster provisioning system do you use?

2016-03-15 Thread Bjørn-Helge Mevik
provisioning tool? - A locally developed solution? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Inconsistent reporting of errors in #SBATCH lines

2016-03-11 Thread Bjørn-Helge Mevik
tch empty-jobname.sm sbatch: option requires an argument -- 'J' Submitted batch job 14221261 $ A more consistent behaviour would have been nice. My suggestion is: report error and fail to submit the job. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Patch for health check during slurmd start

2016-03-03 Thread Bjørn-Helge Mevik
writes: > We are looking for comments and feedback on this proposed behavior [...] > +#define HEALTH_RETRY_DELAY 10 Have you thought about using the health_check_interval instead? Or make it a separate configurable option? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Re

[slurm-dev] Re: Kill Signals Sent By SLURM

2016-02-26 Thread Bjørn-Helge Mevik
obs just before they time out. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Bug and suggested fix in testsuite test 14.10

2016-02-25 Thread Bjørn-Helge Mevik
Test 14.10 in the test suite (of slurm 15.08.8, at least) uses $sinfo -tidle -h -o%n to find idle nodes. This only works if NodeHostname == NodeName on the nodes. The following should work regardless of this: $scontrol show hostnames \$($sinfo -tidle -h -o%N) -- Regards, Bjørn-Helge

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
n a process needs more memory instead of killing the process. If I'm correct, oom will _not_ kill a job due to cached data. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Preempting without account limits

2015-12-18 Thread Bjørn-Helge Mevik
"Wiegand, Paul" writes: > This worked. Thank you Bjørn-Helge. You're welcome! :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
Felip Moll writes: > I will try JobAcctGatherParams to NoShared. > > This is an example of job step being killed. It's being killed by oom, but > it's invoked by cgroups: Since the job was killed by the oom, NoShared will not help. It does not affect cgroups. -- Rega

[slurm-dev] Re: cgroups and memory accounting

2015-12-15 Thread Bjørn-Helge Mevik
e data between several processes, the shared space will be counted once for each process(!). Cgroups seems to count the shared data only once. So if a process is killed by oom instead of by slurm, it is probably not due to shared data. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Re

[slurm-dev] Re: Preempting without account limits

2015-12-14 Thread Bjørn-Helge Mevik
s enough. We have the partition because our lowpri jobs are allowed to run on special nodes (like hugemem or accellerator nodes) that normal jobs are not allowed to use.) I hope this made sense to you. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: [slurm-devel] update SLURM 2.6.7 to SLURM 15.0.8.4

2015-11-16 Thread Bjørn-Helge Mevik
with static libraries, which slurm does _not_ install. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-14 Thread Bjørn-Helge Mevik
Thanks. Nice to know! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Bjørn-Helge Mevik
we activated checkpointing. When slurmcltd started, the checkpointing plugin expected some extra data in the job states, which obviously wasn't there, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Regards, Bjørn-Helge

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-10-05 Thread Bjørn-Helge Mevik
thorough, unfortunately, but according to https://lists.fedoraproject.org/pipermail/mingw/2012-January/004421.html the .la files are only needed in order to link against static libraries, and since Slurm doesn't provide any static libraries, I guess it would be safe for the slurm-devel rpm not to

[slurm-dev] Re: Issues with --switches option

2015-09-03 Thread Bjørn-Helge Mevik
h IB1 or all with IB2. Search for "Matching OR" in the sbatch man page for details. (We used this on our previous cluster, which had two different IB networks.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Understand GrpCPUMins

2015-06-30 Thread Bjørn-Helge Mevik
LIMIT *** That usually means the job tried to run longer than its --time specification. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
atabase makes Gold quite slow, so we have had to add quite a lot of error checking and handling in the prolog and epilog scripts. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
Christopher Samuel writes: > http://karaage.readthedocs.org/en/latest/introduction.html Karaage looks interesting for managing projects and users. Can it manage usage limits? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
.03.7. Perhaps this has changed in later versions? Also, "nothing is ever easy": we want to account not CPU hours, but PE (processor equivalents) hours. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Off-topic: What accounting system do you use?

2015-06-24 Thread Bjørn-Helge Mevik
accounting. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: (Custom) warnings from job_submit.lua?

2015-05-12 Thread Bjørn-Helge Mevik
Ok, thanks. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] (Custom) warnings from job_submit.lua?

2015-05-11 Thread Bjørn-Helge Mevik
.03.7, btw.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-17 Thread Bjørn-Helge Mevik
Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
pi itself? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
rds, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: "Nested cgroup" messages

2015-03-18 Thread Bjørn-Helge Mevik
Thanks! I'll try to apply that patch. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] "Nested cgroup" messages

2015-03-17 Thread Bjørn-Helge Mevik
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Expanding TotalCPU to include child processes

2015-03-04 Thread Bjørn-Helge Mevik
clude CPU time of child processes.” In my experience, that description might not be accurate. It seems also child processes are included, as long as the job doesn't time out. Here is an email I wrote about it last year: From: Bjørn-Helge Mevik Subject: [slurm-dev] UserCPU etc. for subprocess

[slurm-dev] Re: Preemption, requeue and checkpointing?

2015-01-09 Thread Bjørn-Helge Mevik
I second this wish. :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: _slurm_cgroup_destroy message?

2014-11-19 Thread Bjørn-Helge Mevik
ng the test suite for versions 14.03.8--14.03.10, we didn't upgrade to .10.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] _slurm_cgroup_destroy message?

2014-11-18 Thread Bjørn-Helge Mevik
the .../uid_NNN directories are removed. Does anyone know what these messages mean? Should we just ignore them? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] slurmctld crashes on testsuite in 14.03.[8--10]

2014-11-06 Thread Bjørn-Helge Mevik
_bitstr_bits(tmp_qos_bitstr) was still 26. Any help in figuring out what goes wrong (or how to fix it :) is appreciated! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Override memory limits with --exclusive?

2014-09-19 Thread Bjørn-Helge Mevik
Thanks for the tip! We actually already have a setup where "srun --ntasks=$SLURM_JOB_NUM_NODES /bin/true" is run at the start of every job, so we're definitely going to look into this. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Override memory limits with --exclusive?

2014-09-18 Thread Bjørn-Helge Mevik
like this? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] UserCPU etc. for subprocesses not registered when a job times out.

2014-09-12 Thread Bjørn-Helge Mevik
COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187 43COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 i.e., time spent in subprocesses is reported. -- Regards, Bjørn-Helge Mevik, dr. scient, Department fo

[slurm-dev] Bug in sgather

2014-08-26 Thread Bjørn-Helge Mevik
ient. It would also be nice if the node-global destinations could be configurable, instead of being hard-coded in the script (or at least be set at the top of the script). For instance, on our system, the node-global file systems are /work and /cluster, not /scratch and /home. -- Regards, B

[slurm-dev] Suggested fixes for slurm test suite

2014-08-26 Thread Bjørn-Helge Mevik
nfortunately, I don't know enough Expect (tcl?) to suggest how to implement that. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: make check fails with 14.03.6

2014-08-18 Thread Bjørn-Helge Mevik
writes: > As far as I can tell, it has been this way (broken) forever. It will be fixed > in 14.03.7. Thx! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] make check fails with 14.03.6

2014-08-15 Thread Bjørn-Helge Mevik
agnu dejagnu-1.4.4-17.el6.noarch # rpm -q check check-0.9.8-1.1.el6.x86_64 Is there something else we are missing? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Customized error messages from job_submit.lua?

2014-08-07 Thread Bjørn-Helge Mevik
Thanks! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Customized error messages from job_submit.lua?

2014-08-07 Thread Bjørn-Helge Mevik
that are printed on the user's stderr? If so, how? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Bjørn-Helge Mevik
Just a short note about terminology. I believe "processor equivalents" (PE) is a much used term for this. It is at least what Maui and Moab uses, if I recall correctly. The "resource*time" would then be PE seconds (or hours, or whatever). -- Regards, Bjørn-Helge Mevik, dr

[slurm-dev] Re: How to spread jobs among nodes?

2014-05-08 Thread Bjørn-Helge Mevik
tributed across many nodes. Also see the SelectParameters configuration parame- ter CR_LLN to use the least loaded nodes in every partition. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Waiting for sbatch-Job completion?

2014-02-13 Thread Bjørn-Helge Mevik
Nicolai Stange writes: > Hi Bjorn-Helge, > > thank you very much for your reply! > > Bjørn-Helge Mevik writes: >> (We did have some problems when using srun inside the script, but I >> believe mpirun should work.) > Indeed it does for OpenMPI and MVAPICH2. >

[slurm-dev] Re: Waiting for sbatch-Job completion?

2014-02-12 Thread Bjørn-Helge Mevik
will _not_ be take into account; all specifications must be in $slurmargs). The salloc command will wait until the command ($script) has finished before exiting. (We did have some problems when using srun inside the script, but I believe mpirun should work.) -- Regards, Bjørn-Helge Mevik, dr

[slurm-dev] Re: reservation/priority problems

2014-01-23 Thread Bjørn-Helge Mevik
artition both jobs start as they should. Sorry for not checking well enough what I did! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: reservation/priority problems

2014-01-17 Thread Bjørn-Helge Mevik
the priority of the job with # scontrol update jobid=20 nice=-1000 does not help. It still does not start. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Memory Usage Fairshare

2014-01-16 Thread Bjørn-Helge Mevik
That would be a very interesting feature. Similarly to what Christopher Samuel wrote, we have «hacked around» the issue for project limits (not fairshare) by converting memory usage to processor-equivalents and using Gold for the accounting. -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: Secure access to SlurmDB

2013-11-15 Thread Bjørn-Helge Mevik
If you switch to use slurmdbd, the password goes into slurmdbd.conf, which only exists on the head node, and only needs to be readable by root. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

2013-10-18 Thread Bjørn-Helge Mevik
Damien François writes: > Hello, > > what is the most efficient way of finding how many jobs are currently > running, pending, etc in the system ? I tend to use squeue -h -o %T | sort | uniq -c -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Completed jobs stuck in RUNNING state in slurmdbd

2013-10-07 Thread Bjørn-Helge Mevik
Bjørn-Helge Mevik writes: We have investigated a bit further: > 1) What happened here? I didn't look far enough in the logs. It turns out that slurmctld segfaulted later the same day (last message in the log was at 2013-09-30T14:34:56+02:00). When it was started, it said: [2013-09-

[slurm-dev] Re: Completed jobs stuck in RUNNING state in slurmdbd

2013-10-02 Thread Bjørn-Helge Mevik
"Loris Bennett" writes: > Hi Bjørn-Helge, > > Bjørn-Helge Mevik > writes: > >> We are running slurm 2.5.6. >> >> This night, about 10 jobs were scheduled and completed, but they are >> still listed as RUNNING in sacct. For instance: >>

[slurm-dev] Completed jobs stuck in RUNNING state in slurmdbd

2013-10-01 Thread Bjørn-Helge Mevik
pilog [2013-09-30T04:11:10+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf [2013-09-30T04:11:10+02:00] Running spank/epilog for jobid [3371606] uid [4010] 2013-09-30T04:11:10+02:00] spank: opening plugin stack /etc/slurm/plugstack.conf [2013-09-30T04:11:10+02:00] debug: [job 3371606] attemptin

[slurm-dev] Re: Slurmctld dies after restart: "Address already in use"

2013-09-24 Thread Bjørn-Helge Mevik
m/SchedMD/slurm/commit/29094e33fcbb4f29e9512059bbdd18ba3504134c > > That fixes several of the problems. I'm not sure why the job_state.new file > is reported by lsof, but will probably investigate further at a later time. Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmctld dies after restart: "Address already in use"

2013-09-23 Thread Bjørn-Helge Mevik
ent ill effects, but don't want to do that on our production cluster until we know it's safe. If they are not needed, perhaps it would be a good idea for slurmctld to close them when starting the prologs/epilogs? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Slurmctld dies after restart: "Address already in use"

2013-09-20 Thread Bjørn-Helge Mevik
[2013-09-20T04:50:16+02:00] debug: power_save module disabled, SuspendTime < 0 [2013-09-20T04:50:16+02:00] error: Error binding slurm stream socket: Address already in use [2013-09-20T04:50:16+02:00] fatal: slurm_init_msg_engine_addrname_port error Address already in use -- Regards, Bjørn-He

[slurm-dev] Re: How does sacct honor the "-S" and "-E" option?

2013-08-23 Thread Bjørn-Helge Mevik
n 2013-05-12T23:03:59 and 2013-05-13T00:00:00). Running is also considered "eligible". > I totally agree your comment on that sacct lacks on the way to filter jobs > that are actually within the time interval. As Danny said: add --state=RUNNING. :) -- Regards, Bjørn-Helge Mevik,

[slurm-dev] Re: How does sacct honor the "-S" and "-E" option?

2013-08-23 Thread Bjørn-Helge Mevik
Danny Auble writes: >>It would have been nice to have the possibility to select jobs that >>were >>_running_ (or _started_) in an interval, but I don't think it's there. > > Just ask for the state to be 'running'. :) -- Regards, Bjørn-Helg

[slurm-dev] Re: How does sacct honor the "-S" and "-E" option?

2013-08-22 Thread Bjørn-Helge Mevik
ice to have the possibility to select jobs that were _running_ (or _started_) in an interval, but I don't think it's there. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Difference between sacct's AllocCPUS and NCPUS?

2013-08-19 Thread Bjørn-Helge Mevik
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurm User Group Meeting and New releases: v2.6.1, v13.12.0-pre1

2013-08-19 Thread Bjørn-Helge Mevik
Moe Jette writes: > Quoting Bjørn-Helge Mevik : > >> >> Moe Jette writes: >> >>> * Changes in Slurm 13.12.0pre1 >>> == >> >> Just curious: Why the sudden jump in version numbering? year.month? > > Correct

[slurm-dev] Re: Slurm User Group Meeting and New releases: v2.6.1, v13.12.0-pre1

2013-08-19 Thread Bjørn-Helge Mevik
to use whole nodes.) > -- Add mechanism for job_submit plugin to generate error message for srun, > salloc or sbatch to log. But still not to stderr, right? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: RLIMIT_DATA effectively a no-op on Linux

2013-07-20 Thread Bjørn-Helge Mevik
allocated 16.1 GiB virtual memory, but is only using 104 MiB resident.) I would suggest looking at cgroups for limiting memory usage. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-29 Thread Bjørn-Helge Mevik
2051 -- Signal ESLURM_INVALID_TIME_LIMIT end -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-26 Thread Bjørn-Helge Mevik
a lot of threads) will "use" a lot of VMEM. There was a change in glibc 2.something (I think it was) in how VMEM is allocated for threads. For instance, our slurmctld right now "uses" 16 GiB VMEM, but only 117 MiB RSS. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for R

[slurm-dev] Re: Question about --switches and max_switch_wait

2013-06-17 Thread Bjørn-Helge Mevik
Moe Jette writes: > Yes, that is correct. Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Question about --switches and max_switch_wait

2013-06-17 Thread Bjørn-Helge Mevik
lue a job can specify for "delay". So in order for allowing users to specify a delay of, say, 12 hours, one must set max_switch_wait in slurm.conf to something as large as 12 hours. Is this the right interpretation? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Problems when using sched/backfill

2013-05-21 Thread Bjørn-Helge Mevik
scheduler used to time out after MessageTimeout/2 seconds, but looking at the code for 2.5.6 this seems to have changed. Keep us posted about what you find. I'm planning to switch to 2.5.6 tomorrow, and have from time to time had problems getting the backfilling to be fast enough. -- Regar

[slurm-dev] Re: slurm - shutdown process

2013-05-09 Thread Bjørn-Helge Mevik
Kevin Abbey writes: > This sounds great. If you can share after testing it would be be very > much appreciated. Will do. (There will be some parts of it tailored to our site, but that shouldn't be hard to remove/change.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department f

  1   2   >