[slurm-dev] Re: chroot usage for jobs in slurm

2012-04-13 Thread Bjørn-Helge Mevik
for sensitive data, in which there should be no (or as few as possible) possibilities for information to leak between jobs. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] List pending reboots?

2012-04-25 Thread Bjørn-Helge Mevik
a given node has a reboot pending? -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] ETA for slurm 2.4 or 2.4 RC?

2012-05-04 Thread Bjørn-Helge Mevik
We are in the process of setting up a new cluster. It is supposed to go in production by the end of June, and we would prefer to have slurm 2.4 on it. Do you have any plans/ideas for when 2.4 (or at least release candidates for 2.4) will be out? -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: ETA for slurm 2.4 or 2.4 RC?

2012-05-07 Thread Bjørn-Helge Mevik
Moe Jette je...@schedmd.com writes: We hope to have a v2.4.0-rc1 in a couple of weeks and release 2.4 a few weeks later. Very nice! -- Cheers, Bjørn-Helge Mevik

[slurm-dev] Gres: documentation discrepancy

2012-05-08 Thread Bjørn-Helge Mevik
of the generic resources that have allocated to the job. Which one is correct? (I'm voting on man srun. :) -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Is default for NodeAddr NodeName or NodeHostname?

2012-05-25 Thread Bjørn-Helge Mevik
Weight=1027 BootTime=2012-05-08T15:07:08 SlurmdStartTime=2012-05-25T10:30:10 (This is with 2.4.0-0.pre4.) (We are planning to use cx-y instead of compute-x-y (the rocks default) on our next cluster, to save some typing.) -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services

[slurm-dev] Re: How to setup local disk as gres?

2012-07-02 Thread Bjørn-Helge Mevik
ThreadsPerCore=1 TmpDisk=0 Weight=666 BootTime=2012-06-13T16:20:49 SlurmdStartTime=2012-07-02T16:32:31 -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] problem with sstat -j jobid.batch when #nodes 1

2012-07-03 Thread Bjørn-Helge Mevik
] done with job [2012-07-03T17:11:02] error: stat_jobacct for invalid job_id: 195 [2012-07-03T17:11:02] debug: _rpc_terminate_job, uid = 501 [2012-07-03T17:11:02] debug: task_slurmd_release_resources: 195 Is there something wrong here, or are we doing something wrong? -- Regards, Bjørn-Helge

[slurm-dev] Re: Questions about the task/cgroup plugin

2012-09-05 Thread Bjørn-Helge Mevik
% of the time. I can send it to you if you like. Yes, I'd very much like that! Jobs killed by memory limit is quite common on our cluster, and users get confused if there is no message telling them why the job died. Thanks for a very informative answer! -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: Questions about the task/cgroup plugin

2012-09-06 Thread Bjørn-Helge Mevik
project on google code called oom-detect.lua. You can browse the code here: Thanks! -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Excessive requeueing of jobs.

2012-10-12 Thread Bjørn-Helge Mevik
8 c11-13 (There is a node-failure in there, and the job failed when it finally got to run long enough.) Apart from a short period around 21:00 the 10., less than 7,000 of the ~ 10,000 cores were used. -- Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

2013-01-21 Thread Bjørn-Helge Mevik
Christopher Samuel sam...@unimelb.edu.au writes: On 18/01/13 19:53, Bjørn-Helge Mevik wrote: I don't know if this is the reason in your case, but note that cgroup in slurm constrains_resident_ RAM, not_allocated_ (virtual) RAM. Hmm, as a sysadmin that doesn't seem very useful, Hmm

[slurm-dev] How to emulate qsub's -sync y/-Wblock=true

2013-03-05 Thread Bjørn-Helge Mevik
, and then polls the queue system until the job has finished. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Re: How to emulate qsub's -sync y/-Wblock=true

2013-03-06 Thread Bjørn-Helge Mevik
, then launch the main program, and then perhaps do some cleanup afterwards. Thus one wouldn't want the job script itself to be run in parallell. -- Regards, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo

[slurm-dev] Strange value for CUDA_VISIBLE_DEVICES

2013-04-10 Thread Bjørn-Helge Mevik
, CUDA_VISIBLE_DEVICES gets the value 0,1633906540 Is this correct? Are we doing something wrong? (This is slurm 2.4.3, running on Rocks 6.0 based on CentOS 6.2.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Strange value of CUDA_VISIBLE_DEVICES

2013-04-11 Thread Bjørn-Helge Mevik
-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Strange value of CUDA_VISIBLE_DEVICES

2013-04-16 Thread Bjørn-Helge Mevik
Gary Brown gbr...@adaptivecomputing.com writes: FYI, the value 1633906540 in hex is 61636f6c, which is ASCII acol and usually points to some kind of buffer overrun bug. Thanks for the tip! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Strange value of CUDA_VISIBLE_DEVICES

2013-04-16 Thread Bjørn-Helge Mevik
lines solved the problem. Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: slurm - shutdown process

2013-05-10 Thread Bjørn-Helge Mevik
Kevin Abbey kevin.ab...@rutgers.edu writes: This sounds great. If you can share after testing it would be be very much appreciated. Will do. (There will be some parts of it tailored to our site, but that shouldn't be hard to remove/change.) -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: Problems when using sched/backfill

2013-05-21 Thread Bjørn-Helge Mevik
to time out after MessageTimeout/2 seconds, but looking at the code for 2.5.6 this seems to have changed. Keep us posted about what you find. I'm planning to switch to 2.5.6 tomorrow, and have from time to time had problems getting the backfilling to be fast enough. -- Regards, Bjørn-Helge Mevik

[slurm-dev] Question about --switches and max_switch_wait

2013-06-17 Thread Bjørn-Helge Mevik
. So in order for allowing users to specify a delay of, say, 12 hours, one must set max_switch_wait in slurm.conf to something as large as 12 hours. Is this the right interpretation? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Question about --switches and max_switch_wait

2013-06-17 Thread Bjørn-Helge Mevik
Moe Jette je...@schedmd.com writes: Yes, that is correct. Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-29 Thread Bjørn-Helge Mevik
ESLURM_INVALID_TIME_LIMIT end -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurm User Group Meeting and New releases: v2.6.1, v13.12.0-pre1

2013-08-20 Thread Bjørn-Helge Mevik
Moe Jette je...@schedmd.com writes: Quoting Bjørn-Helge Mevik b.h.me...@usit.uio.no: Moe Jette je...@schedmd.com writes: * Changes in Slurm 13.12.0pre1 == Just curious: Why the sudden jump in version numbering? year.month? Correct. We're Ubuntu fans

[slurm-dev] Difference between sacct's AllocCPUS and NCPUS?

2013-08-20 Thread Bjørn-Helge Mevik
-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: How does sacct honor the -S and -E option?

2013-08-22 Thread Bjørn-Helge Mevik
jobs that were _running_ (or _started_) in an interval, but I don't think it's there. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: How does sacct honor the -S and -E option?

2013-08-23 Thread Bjørn-Helge Mevik
Danny Auble d...@schedmd.com writes: It would have been nice to have the possibility to select jobs that were _running_ (or _started_) in an interval, but I don't think it's there. Just ask for the state to be 'running'. slaps palm on head / :) -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: How does sacct honor the -S and -E option?

2013-08-23 Thread Bjørn-Helge Mevik
:00). Running is also considered eligible. I totally agree your comment on that sacct lacks on the way to filter jobs that are actually within the time interval. As Danny said: add --state=RUNNING. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University

[slurm-dev] Slurmctld dies after restart: Address already in use

2013-09-20 Thread Bjørn-Helge Mevik
[2013-09-20T04:50:16+02:00] debug: power_save module disabled, SuspendTime 0 [2013-09-20T04:50:16+02:00] error: Error binding slurm stream socket: Address already in use [2013-09-20T04:50:16+02:00] fatal: slurm_init_msg_engine_addrname_port error Address already in use -- Regards, Bjørn-Helge

[slurm-dev] Re: Slurmctld dies after restart: Address already in use

2013-09-23 Thread Bjørn-Helge Mevik
production cluster until we know it's safe. If they are not needed, perhaps it would be a good idea for slurmctld to close them when starting the prologs/epilogs? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Completed jobs stuck in RUNNING state in slurmdbd

2013-10-01 Thread Bjørn-Helge Mevik
] attempting to run epilog [/hpc/sbin/epilog_slurmd] [2013-09-30T04:11:10+02:00] debug: completed epilog for jobid 3371606 [2013-09-30T04:11:10+02:00] debug: Job 3371606: sent epilog complete msg: rc = 0 -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University

[slurm-dev] Re: Memory Usage Fairshare

2014-01-16 Thread Bjørn-Helge Mevik
That would be a very interesting feature. Similarly to what Christopher Samuel wrote, we have «hacked around» the issue for project limits (not fairshare) by converting memory usage to processor-equivalents and using Gold for the accounting. -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] Re: reservation/priority problems

2014-01-23 Thread Bjørn-Helge Mevik
in the right partition both jobs start as they should. Sorry for not checking well enough what I did! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Bjørn-Helge Mevik
Just a short note about terminology. I believe processor equivalents (PE) is a much used term for this. It is at least what Maui and Moab uses, if I recall correctly. The resource*time would then be PE seconds (or hours, or whatever). -- Regards, Bjørn-Helge Mevik, dr. scient, Department

[slurm-dev] Re: Customized error messages from job_submit.lua?

2014-08-07 Thread Bjørn-Helge Mevik
Thanks! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] make check fails with 14.03.6

2014-08-15 Thread Bjørn-Helge Mevik
.noarch # rpm -q check check-0.9.8-1.1.el6.x86_64 Is there something else we are missing? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Suggested fixes for slurm test suite

2014-08-26 Thread Bjørn-Helge Mevik
-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Bug in sgather

2014-08-26 Thread Bjørn-Helge Mevik
be nice if the node-global destinations could be configurable, instead of being hard-coded in the script (or at least be set at the top of the script). For instance, on our system, the node-global file systems are /work and /cluster, not /scratch and /home. -- Regards, Bjørn-Helge Mevik, dr. scient

[slurm-dev] UserCPU etc. for subprocesses not registered when a job times out.

2014-09-12 Thread Bjørn-Helge Mevik
00:01:07 01:03.980 00:02.207 01:06.187 43COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 i.e., time spent in subprocesses is reported. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research

[slurm-dev] Override memory limits with --exclusive?

2014-09-18 Thread Bjørn-Helge Mevik
this? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Override memory limits with --exclusive?

2014-09-19 Thread Bjørn-Helge Mevik
Thanks for the tip! We actually already have a setup where srun --ntasks=$SLURM_JOB_NUM_NODES /bin/true is run at the start of every job, so we're definitely going to look into this. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] slurmctld crashes on testsuite in 14.03.[8--10]

2014-11-06 Thread Bjørn-Helge Mevik
. In other core dumps of 14.03.9, the g_qos_count was 241 or 233, while _bitstr_bits(tmp_qos_bitstr) was still 26. Any help in figuring out what goes wrong (or how to fix it :) is appreciated! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] _slurm_cgroup_destroy message?

2014-11-18 Thread Bjørn-Helge Mevik
the .../uid_NNN directories are removed. Does anyone know what these messages mean? Should we just ignore them? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: _slurm_cgroup_destroy message?

2014-11-19 Thread Bjørn-Helge Mevik
running the test suite for versions 14.03.8--14.03.10, we didn't upgrade to .10.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Preemption, requeue and checkpointing?

2015-01-09 Thread Bjørn-Helge Mevik
I second this wish. :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Nested cgroup messages

2015-03-18 Thread Bjørn-Helge Mevik
Thanks! I'll try to apply that patch. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Nested cgroup messages

2015-03-17 Thread Bjørn-Helge Mevik
the memory jobs can use. Our cgroup.conf contains: CgroupMountpoint=/dev/cgroup CgroupAutomount=yes ConstrainSwapSpace=yes and our slurm.conf contains: TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory -- Regards, Bjørn-Helge

[slurm-dev] Re: Expanding TotalCPU to include child processes

2015-03-04 Thread Bjørn-Helge Mevik
time of child processes.” In my experience, that description might not be accurate. It seems also child processes are included, as long as the job doesn't time out. Here is an email I wrote about it last year: From: Bjørn-Helge Mevik b.h.me...@usit.uio.no Subject: [slurm-dev] UserCPU etc

[slurm-dev] (Custom) warnings from job_submit.lua?

2015-05-11 Thread Bjørn-Helge Mevik
.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: (Custom) warnings from job_submit.lua?

2015-05-12 Thread Bjørn-Helge Mevik
Ok, thanks. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Bjørn-Helge Mevik
openmpi itself? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-17 Thread Bjørn-Helge Mevik
Thanks! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Off-topic: What accounting system do you use?

2015-06-24 Thread Bjørn-Helge Mevik
for accounting. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
. Perhaps this has changed in later versions? Also, nothing is ever easy: we want to account not CPU hours, but PE (processor equivalents) hours. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
Christopher Samuel sam...@unimelb.edu.au writes: http://karaage.readthedocs.org/en/latest/introduction.html Karaage looks interesting for managing projects and users. Can it manage usage limits? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Bjørn-Helge Mevik
makes Gold quite slow, so we have had to add quite a lot of error checking and handling in the prolog and epilog scripts. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Understand GrpCPUMins

2015-06-30 Thread Bjørn-Helge Mevik
LIMIT *** That usually means the job tried to run longer than its --time specification. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: [slurm-devel] update SLURM 2.6.7 to SLURM 15.0.8.4

2015-11-16 Thread Bjørn-Helge Mevik
files? They should only be needed with static libraries, which slurm does _not_ install. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-14 Thread Bjørn-Helge Mevik
Thanks. Nice to know! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Bjørn-Helge Mevik
we activated checkpointing. When slurmcltd started, the checkpointing plugin expected some extra data in the job states, which obviously wasn't there, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Regards, Bjørn-Helge Mevik, dr

[slurm-dev] Re: Issues with --switches option

2015-09-03 Thread Bjørn-Helge Mevik
d get either all nodes with IB1 or all with IB2. Search for "Matching OR" in the sbatch man page for details. (We used this on our previous cluster, which had two different IB networks.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-10-05 Thread Bjørn-Helge Mevik
e slurm-devel rpm not to include these files. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-15 Thread Bjørn-Helge Mevik
r job shares the same data between several processes, the shared space will be counted once for each process(!). Cgroups seems to count the shared data only once. So if a process is killed by oom instead of by slurm, it is probably not due to shared data. -- Regards, Bjørn-Helge Mevik

[slurm-dev] Re: Preempting without account limits

2015-12-14 Thread Bjørn-Helge Mevik
the qos is enough. We have the partition because our lowpri jobs are allowed to run on special nodes (like hugemem or accellerator nodes) that normal jobs are not allowed to use.) I hope this made sense to you. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
oups. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Bjørn-Helge Mevik
ush the cache when a process needs more memory instead of killing the process. If I'm correct, oom will _not_ kill a job due to cached data. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Preempting without account limits

2015-12-18 Thread Bjørn-Helge Mevik
"Wiegand, Paul" <wieg...@ist.ucf.edu> writes: > This worked. Thank you Bjørn-Helge. You're welcome! :) -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Kill Signals Sent By SLURM

2016-02-26 Thread Bjørn-Helge Mevik
automatically requeue jobs just before they time out. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Bug and suggested fix in testsuite test 14.10

2016-02-25 Thread Bjørn-Helge Mevik
Test 14.10 in the test suite (of slurm 15.08.8, at least) uses $sinfo -tidle -h -o%n to find idle nodes. This only works if NodeHostname == NodeName on the nodes. The following should work regardless of this: $scontrol show hostnames \$($sinfo -tidle -h -o%N) -- Regards, Bjørn-Helge

[slurm-dev] What cluster provisioning system do you use?

2016-03-15 Thread Bjørn-Helge Mevik
ool? - A locally developed solution? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Inconsistent reporting of errors in #SBATCH lines

2016-03-11 Thread Bjørn-Helge Mevik
sbatch: option requires an argument -- 'J' Submitted batch job 14221261 $ A more consistent behaviour would have been nice. My suggestion is: report error and fail to submit the job. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Patch for health check during slurmd start

2016-03-03 Thread Bjørn-Helge Mevik
<hm...@t-hamel.fr> writes: > We are looking for comments and feedback on this proposed behavior [...] > +#define HEALTH_RETRY_DELAY 10 Have you thought about using the health_check_interval instead? Or make it a separate configurable option? -- Regards, Bjørn-Helge Mevik

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-29 Thread Bjørn-Helge Mevik
+1 -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: using gdb to debug slurm-15.08?

2016-04-28 Thread Bjørn-Helge Mevik
be Just a note: I tried this (for a different reason), but found out it didn't have any effect (gather the output to a log file and look at the gcc lines). However, if I did -D '%with_cflags CFLAGS="-O0 -g3"' (i.e., removed the initial "_"), it had the desired effect. -- Regar

[slurm-dev] Slurm no longer optimized by default

2016-04-26 Thread Bjørn-Helge Mevik
I just noticed that as of 14.11.6, optimization is turned off (-O0) by default when building slurm. Is there any reason not to use --disable-debug when building slurm for a production cluster? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurm mail domain?

2017-03-02 Thread Bjørn-Helge Mevik
ing or a default "@localhost". A MailDomain config parameter was added in Slurm 17.02. A different option would be to configure your sendmail to accept domain-less mails (and perhaps add a default domain itself). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, Universi

[slurm-dev] Re: The canonical way to write to user's output (stderr) log file on end of job

2016-08-30 Thread Bjørn-Helge Mevik
ngs, this sets up a signal handler for the EXIT "signal", which prints out resource usage. As long as users remember to source the setup file, they get the usage statistics in the bottom of their stdout file. Not very elegant, but it works. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Prolog script (maybe) question?

2016-09-15 Thread Bjørn-Helge Mevik
To me, this sounds like a job for a job submit plugin, for instance job_submit.lua. That way you could reject the job before it gets submitted into the queue. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
Jordan Willis <jwillis0...@gmail.com> writes: >Thank you, >Can you confirm that this will take an update from SLURM 14.11.15 to >current? I never ran 14.11, but in 14.03, you can use GrpCPUs=1000 instead of GrpTRES=cpu=1000. -- Regards, Bjørn-Helge Mevik, dr. sci

[slurm-dev] Re: max submit tasks

2016-11-22 Thread Bjørn-Helge Mevik
s per account; if I have access to more then one account, I can use 1000 cpus in each account. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] No longer possible to use scancel in PrologSlurmctld?

2016-11-21 Thread Bjørn-Helge Mevik
I'd guess it should have been possible to use scancel in PrologSlurmctld also in 15.08.12. Does anyone know if this is an intentional change (and SchedMD just forgot to update the docs) or a bug? (I haven't found anything relevant in the NEWS file or on bugs.schedmd.com.) -- Regards, Bjørn-Helge

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-02 Thread Bjørn-Helge Mevik
There is a plugin under development, that will/might provide those features. It was presented at SLUG 16: http://slurm.schedmd.com/SLUG16/MCS.pdf -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: How to use the EpilogSlurmctld to print job statistics

2016-10-13 Thread Bjørn-Helge Mevik
ry usage, etc) commands to generate > my report. Does this approach make sense or are there better > alternatives. sacct can also give you the submit time, start time, end time and elapsed time. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Submit job with maximum ntasks per node

2016-12-14 Thread Bjørn-Helge Mevik
Check out the thread on this list about a week ago, titled "Unrestricted use of a node". (In short, --exclusive with --mem=0 or --mem-per-cpu=0 might be more or less what you want.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-16 Thread Bjørn-Helge Mevik
use the "Proportional set size" [1] (JobAcctGatherParams=UsePss), which is cgroup uses (I believe), and sounds like the best estimate to me. [1] https://en.wikipedia.org/wiki/Proportional_set_size -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
mber of cores, I don't know. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Bjørn-Helge Mevik
which cpus it is allowed to use.) This is on 15.08.12. YMMV. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Slurmdbd Perl api?

2017-04-03 Thread Bjørn-Helge Mevik
qw(:all SLURMDB_ADD_USER); $what = SLURMDB_ADD_USER(); just gives the error "SLURMDB_ADD_USER is not a valid Slurmdb macro" -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Job-Specific Working Directory on Local Scratch

2017-03-14 Thread Bjørn-Helge Mevik
file names to a dot file in $SCRATCH) - The Epilog copies any registered files back to the job submit dir (it uses "su - $USER" when doing this). - The epilog deletes the directory -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signa

[slurm-dev] Re: Rolling maintenance jobs

2017-08-02 Thread Bjørn-Helge Mevik
ir job, remove the "fixme" feature from the node, and then request themself to be requeued. Prior to submit the jobs, we add the "fixme" feature to all nodes needing maintenance. (In reality, our setup is a little mor complex, since it includes reinstalling the os on the nodes,

[slurm-dev] Re: Rolling maintenance jobs

2017-08-03 Thread Bjørn-Helge Mevik
e to "RESUME"), so we will be looking at this feature again. Thanks for the tip! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: #SBATCH --time= not always overriding default?

2017-06-30 Thread Bjørn-Helge Mevik
a > manner that's predictable, both for the programmer and for the user. It is by design, because people often need to give arguments or options to their jobscript, e.g., sbatch --time=1-0:0:0 myjob.sh inputfile -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Prolog and sbatch

2017-07-02 Thread Bjørn-Helge Mevik
; creation (return from sbatch), but looks like this assumption is wrong? That is right, that is wrong. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Bjørn-Helge Mevik
ool > https://www.nsc.liu.se/~kent/python-hostlist/ by Kent Engström at NSC. It's > simple to install this as an RPM package, see > https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists For the simple case you show, you could just use $ scontrol show hostnames a[095,097-098] a09

[slurm-dev] Re: How to get pids of a job

2017-05-16 Thread Bjørn-Helge Mevik
show the nodes and the pids. Not that I know of, but it should be possible to script. > And how to parse the nodelist like "cn[11033,11069],gn[1103-1120]" ? scontrol show hostnames cn[11033,11069],gn[1103-1120] -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Com

[slurm-dev] Re: thoughts on task preemption

2017-05-23 Thread Bjørn-Helge Mevik
efore modifying anything. In our setup, the first option is preferrable; just putting it on the queue and let it wait until it's turn. But of course, there are other setups where the second option would be best. Could you perhaps make it configurable, so a site can choose? -- Regards, Bjørn-Helge Mevik,

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-06 Thread Bjørn-Helge Mevik
Life=0 to turn of the decaying of historic usage. Then you can set FairshareWeight to 0 and use the Grp*Mins parameters to set hard limits. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Preemtion and signals

2017-10-10 Thread Bjørn-Helge Mevik
something wrong with how my partitions are defined? That sounds unlikely, IMO. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-dev] Re: Preemtion and signals

2017-10-09 Thread Bjørn-Helge Mevik
s the signal arrives. I got bit by this behaviour trying to do exactly the same that you did. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

  1   2   >