[slurm-dev] Re: SLURM v14 munge issues auth to SLURM v16 DBD

2016-09-11 Thread Christopher Samuel
about. That said, to my untutored eye that looks more like a munge problem than anything else - you will want to check that your keys are the same and that your clocks are in sync (NTP is your friend). Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-12 Thread Christopher Samuel
uggest benchmarking with your current configuration versus disabling HT and running on real cores only. Basically, whichever gets you better throughput should be your default config. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Compu

[slurm-dev] Re: cpu identifier

2016-09-14 Thread Christopher Samuel
o really grok what it's saying.. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-18 Thread Christopher Samuel
On 18/09/16 03:45, John DeSantis wrote: > Try adding a "DefMemPerCPU" statement in your partition definitions, e.g You can also set that globally. # Global default for jobs - request 2GB per core wanted. DefMemPerCPU=2048 All the best, Chris -- Christopher SamuelS

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-19 Thread Christopher Samuel
al RAM/core ratio on the low memory nodes on one system, it's 1/8th of the low memory nodes on another system so making it lower doesn't buy us much and 2 GB/core means most NAMD jobs will run without issues. All the best, Chris -- Christopher SamuelSenior Systems Administrator V

[slurm-dev] Re: fwd_tree_thread ... failed to forward the message

2016-09-19 Thread Christopher Samuel
udburst and were convinced it was related to that, but this looks like the actual problem. Now why slurmctld doesn't upgrade that information on an upgrade is another matter altogether. Thanks! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Scien

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-19 Thread Christopher Samuel
urm.conf, this one escaped me! There's always new ones there, I swear they're breeding.. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www

[slurm-dev] Re: fwd_tree_thread ... failed to forward the message

2016-09-19 Thread Christopher Samuel
On 19/09/16 22:58, Christopher Samuel wrote: > Thanks so much Ulf, you've just answered a puzzle I've been seeing on an > x86 cluster I'm helping out with! ...and stopping slurmctld & slurmd's (slurmdbd was left going), moving /var/spool/slurm/jobs/node_stat

[slurm-dev] Re: X11 plugin problems

2016-09-20 Thread Christopher Samuel
ce. SSH host based authentication within the cluster helps with that, along with caching SSH keys in /etc/ssh/ssh_known_hosts. Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone:

[slurm-dev] Re: Slurm array scheduling question

2016-09-20 Thread Christopher Samuel
up in a later release (can't remember when sorry!). Hope that helps, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel
re how far back you > can go, but I suspect 14.x talking to a 16.x dbd would be fine. Slurm supports 2 major releases behind. So a 16.05.x slurmdbd should talk to 15.08.x and 14.11.x slurmctld's but *not* 14.03.x. All the best, Chris -- Christopher SamuelSenior Systems Admi

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-25 Thread Christopher Samuel
clusters which is around 3GB for 8 million job steps. Neither cause us any issues these days (we used to have a problem when, for complicated historical reasons, slurmdbd was running on a 32-bit VM and could run out of memory). Admittedly we do have beefy database servers. :-) All the best, Chris

[slurm-dev] Bug in node suspend/resume config code with scontrol reconfigure in 16.05.x (bugzilla #3078)

2016-09-25 Thread Christopher Samuel
All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
On 27/09/16 17:40, Philippe wrote: > /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null I think you want to check whether that's really restarting it or just doing an "scontrol reconfigure" which won't (shouldn't) restart it. -- Christoph

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
7;t added the cluster to slurmdbd with "sacctmgr" yet so I suspect all your accounting info is getting lost. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
estion - you've got the shutdown log from slurmctld and the start log of a slurmd - what happens when slurmctld starts up? That might be your clue about why yours jobs are getting killed. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Comput

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel
of offline nodes and make a script to restore via scontrol 2) shutdown slurmctld and all slurmds 3) move the node_stat* files out of the way 4) start up slurmd again 5) start up slurmctld 6) run the script created at step 1 Hope that helps! All the best, Chris -- Christopher SamuelSenior Sy

[slurm-dev] Re: CGroups

2016-09-27 Thread Christopher Samuel
On 26/09/16 16:51, Lachlan Musicman wrote: > Does this mean that it's now considered acceptable to run cgroups for > ProcTrackType? We've been running with that on all our x86 clusters since we switched to Slurm, haven't seen an issue yet. All the best, Chris --

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel
On 28/09/16 16:25, Barbara Krasovec wrote: > Yes, this worked! Thank you very much for your help! My pleasure! -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 h

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread Christopher Samuel
On 29/09/16 01:16, John DeSantis wrote: > We get the same snippet when our logrotate takes action against the > cltdlog: Does your slurmctld restart then too? -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Emai

[slurm-dev] Re: Send notification email

2016-10-04 Thread Christopher Samuel
AP lookup to rewrite users email to the value in LDAP) But really this isn't a Slurm issue, it's a host config issue for Postfix. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimel

[slurm-dev] Re: Best way to control synchronized clocks in cluster?

2016-10-06 Thread Christopher Samuel
s as well and if they're out of step well then GPFS will stop working on the node making Slrm the least of your worries. :-) So just run ntpd. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@

[slurm-dev] Re: Send notification email

2016-10-06 Thread Christopher Samuel
ned up. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel
Login Proper Name Used Energy - --- - --- avoca vlscisamuel Christopher Sa+ 151030 -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Christopher Samuel
ct that's what's triggering the different display in sreport, a line per association/partition. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.a

[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-24 Thread Christopher Samuel
tions aren't getting blocked, and also check that the hostname correctly resolves. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Query number of cores allocated per node for a job

2016-10-25 Thread Christopher Samuel
re already running disparate number of jobs using variable cores, how do I see what cores on what nodes Slurm has allocated my running job? I know I can go and poke around with cgroups, but is there a way to get that out of squeue, sstat or sacct? All the best, Chris -- Christopher Samuel

[slurm-dev] Re: Query number of cores allocated per node for a job

2016-10-26 Thread Christopher Samuel
I guess you will have to query the > cgroup hierarchy. No need, I'm just trying to automate the detection of bad jobs which are spanning nodes but not using the cores on the other nodes and I wanted a way to quantify how many cores were being wasted by the job. Thanks again! Chris -- Ch

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Christopher Samuel
On 28/10/16 08:44, Lachlan Musicman wrote: > So I checked the system, noticed that one node was drained, resumed it. > Then I tried both > > scontrol requeue 230591 > scontrol resume 230591 What happens if you "scontrol hold" it first before "scontrol release&quo

[slurm-dev] Re: Slurm versions 16.05.6 and 17.02.0-pre3 are now available

2016-10-30 Thread Christopher Samuel
All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Passing binding information

2016-10-31 Thread Christopher Samuel
ontact them directly. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Passing binding information

2016-11-02 Thread Christopher Samuel
On 02/11/16 02:01, Riebs, Andy wrote: > Interesting -- thanks for the info Chris. No worries, it's a bit sad I think, but I can understand it. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Christopher Samuel
tion in a partition-oriented format. This is ignored if the --format option is specified. Except it's not being ignored when you use --format (-o). All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel
any period of time that information will be lost. We build from source and use: StateSaveLocation = /var/spool/slurm/jobs but the decision is yours where exactly to put it. But /tmp is almost certainly the second worst place (after /dev/shm). All the best, Chris -- Christopher Samue

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel
cpu=6,mem=4G,node=1mic:1 6449483.extern extern cpu=6,mem=4G,node=1mic:1 All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 htt

[slurm-dev] Re: Re:

2016-11-08 Thread Christopher Samuel
On 09/11/16 09:50, Lachlan Musicman wrote: > I don't know Chris, I think that /dev/null would rate tbh. :) Ah, but that's a file (OK character special device), not a directory. ;-) -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Science

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-08 Thread Christopher Samuel
is that the batch step is of course only on the first node, but it says it was allocated 2 GRES. I suspect that's just a symptom of Slurm only keeping a total number. I don't think Slurm can give you an uneven GRES allocation, but the SchedMD folks would need to confirm that I'm af

[slurm-dev] Re: How to account how many cpus/gpus per node has been allocated to a specific job?

2016-11-13 Thread Christopher Samuel
ic:1 Reservation=(null) All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Christopher Samuel
Having private containers is on the roadmap for Shifter. Shifter also integrates with Slurm. All the best! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel
but apparently with Slurm it's not - just tested it out and using: --gres mic results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0 set in its environment. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Comp

[slurm-dev] Re: Gres issue

2016-11-16 Thread Christopher Samuel
On 17/11/16 11:31, Christopher Samuel wrote: > It depends on the library used to pass options, Oops - that should be parse, not pass. Need more caffeine.. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email:

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-17 Thread Christopher Samuel
ng. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] RE: Wrong behaviour of "--tasks-per-node" flag

2016-11-20 Thread Christopher Samuel
rwise you're at the mercy of what your mpiexec chooses to do. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: job arrays, fifo queueing not wanted

2016-12-14 Thread Christopher Samuel
ources. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel
JobAcctGatherType=jobacct_gather/cgroup If the former, try the latter and see if it helps get better numbers (we went to the former after suggestions from SchedMD but from highly unreliable memory had to revert due to similar issues to those you are seeing). Best of luck, Chris -- Christopher S

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel
cause of this issue (from memory). -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
ml Hope this helps! All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
I strongly believe that will be necessary, sorry! Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Christopher Samuel
puts between tasks. OK, I'm not sure how Slurm will behave with multiple srun's and cons_res and CR_LLN but it's still worth a shot. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@uni

[slurm-dev] Re: mail job status to user

2017-01-09 Thread Christopher Samuel
rm? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel
into. You do need PrologFlags=contain for that to ensure that all jobs get an "extern" batch step on job creation for these processes to be adopted into. We use both here with great success. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Vi

[slurm-dev] Re: Prolog behavior with and without srun

2017-01-09 Thread Christopher Samuel
On 10/01/17 10:57, Christopher Samuel wrote: > If you are unlucky enough to have SSH based job launchers then you would > also look at the BYU contributed pam_slurm_adopt Actually this is useful even without that as it allows users to SSH into a node they have a job on and not disturb the

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
he best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
On 10/01/17 18:56, Ole Holm Nielsen wrote: > For the record: Torque will always send mail if a job is aborted It's been a few years since I've used Torque so I don't remember that behaviour. Thanks for the info! -- Christopher SamuelSenior Systems Administrator

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Christopher Samuel
me to their registered email address that's stored in LDAP. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Christopher Samuel
area is a high-performance parallel filesystem shared across all nodes). https://github.com/vlsci/spank-private-tmp All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3

[slurm-dev] Re: New User Creation Issue

2017-01-24 Thread Christopher Samuel
;s been because the slurmdbd cannot connect back to slurmctld to send RPCs on the IP address that slurmctld has registered with slurmdbd. What does this say? sacctmgr list clusters cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computatio

[slurm-dev] Re: Daytime Interactive jobs

2017-01-29 Thread Christopher Samuel
ay. :-( Might be a feature request.. cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: 16.05.8 bug with memory handling?

2017-01-29 Thread Christopher Samuel
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:* [...] Best of luck, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: Job priority/cluster utilization help

2017-02-08 Thread Christopher Samuel
t of architectures) individually. Best of luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel
cheers, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel
ck up again. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Christopher Samuel
Torque+Moab/Maui here and at VPAC before that - we would always start Moab paused so we could check out what impact any changes had to our queues & priorities before starting jobs running. Measure twice, cut once. cheers! Chris -- Christopher SamuelSenior Systems Administrator VL

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-02-19 Thread Christopher Samuel
ystems having no more than 2500 nodes or the cube root for larger systems. The value may not exceed 65533. If so then I suspect that this is a possible transient DNS failure? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI -

[slurm-dev] Re: Storage of job submission and working directory paths

2017-03-07 Thread Christopher Samuel
be useful to us here too. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: reporting used memory with job Accounting or Completion plugins?

2017-03-12 Thread Christopher Samuel
ps with srun you can also monitor them as the job is going with 'sstat' (rather than just post-mortem with sacct). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-19 Thread Christopher Samuel
7;t really blame Slurm for not catering to this. It can use cgroups to partition cores to jobs precisely so it doesn't need to care what the load average is - it knows the kernel is ensuring the cores the jobs want are not being stomped on by other tasks. Best of luck! Chris -- Christop

[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-21 Thread Christopher Samuel
e best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Randomly jobs failures

2017-04-11 Thread Christopher Samuel
save/job.830332/environment, No such > file or directory I would suggest that you are looking at transient NFS failures (which may not be logged). Are you using NFSv3 or v4 to talk to the NFS server and what are the OS's you are using for both? cheers, Chris -- Christopher Samue

[slurm-dev] Distinguishing past jobs that waited due to dependencies vs resources?

2017-04-11 Thread Christopher Samuel
ep' of the source code after reading 'man sacct' and not finding anything (also running 'sacct -e' and not seeing anything useful there either) doesn't offer much hope. Anyone else dealing with this? We're on 16.05.x at the moment with slurmdbd. All the best

[slurm-dev] Re: Jobs submitted simultaneously go on the same GPU

2017-04-11 Thread Christopher Samuel
0 NodeAddr=thing-knc[01-03] RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2 You'll also need to restart slurmctld & all slurmd's to pick up this new config, I don't think "scontrol reconfigure" will deal with this. Best of luck, Chris --

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Christopher Samuel
ble again. +1 for running your own LDAP. I would seriously look at a cluster toolkit for running nodes, especially if it supports making a single image that your compute nodes then netboot. That way you know everything is consistent. Best of luck, Chris -- Christopher SamuelSenior Syst

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel
so we fell back to using our own LDAP server with Karaage to manage project/account applications, adding people to slurmdbd, etc. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: LDAP required?

2017-04-19 Thread Christopher Samuel
age-Cluster All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: discrepancy between node config and # of cpus found

2017-05-21 Thread Christopher Samuel
a node and then Slurm isn't going to put more jobs there (unless you tell it to ignore memory, which is not likely to end well). All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au

[slurm-dev] Re: Compute nodes going to drained/draining state

2017-05-23 Thread Christopher Samuel
e understand what might be wrong? Anything setting a drain state is meant to also set a reason, what does "scontrol show node $NODE" say for these? Also are there any relevant messages in your slurmctld and slurmd logs? Best of luck, Chris -- Christopher SamuelSenior Systems Ad

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
by > a job have finished at completion? Are you not using cgroups for enforcement? Usually that picks everything up. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Christopher Samuel
MPI launchers (and other naughtiness). Good luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: sinfo

2017-05-24 Thread Christopher Samuel
nfo --format="%60N %.15G %.30E %.10A" The reason can be quite long, but there doesn't seem to be a way to just show the status as down/drain/idle/etc. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email

[slurm-dev] Re: Accounting: preventing scheduling after TRES limit reached (permanently)

2017-06-04 Thread Christopher Samuel
e time. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: srun - replacement for --x11?

2017-06-06 Thread Christopher Samuel
On 06/06/17 23:46, Edward Walter wrote: > Doesn't that functionality come from a spank plugin? > https://github.com/hautreux/slurm-spank-x11 Yes, that's the one we use. Works nicely. Provides the --x11 option for srun. All the best, Chris -- Christopher Samuel

[slurm-dev] Re: How to get Qos limits

2017-06-06 Thread Christopher Samuel
R} format=MaxJobsPerUser For a more general view you would do: sacctmgr list user ${USER} withassoc Hope this helps, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-02 Thread Christopher Samuel
limits.html Best of luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel
On 07/08/17 14:08, Lachlan Musicman wrote: > In slurm.conf, there is a RebootProgram - does this need to be a direct > link to a bin or can it be a command? We have: RebootProgram = /sbin/reboot Works for us. cheers, Chris -- Christopher SamuelSenior Systems Adminis

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Christopher Samuel
own for that. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Proctrack cgroup; documentation bug

2017-08-13 Thread Christopher Samuel
On 14/08/17 08:55, Lachlan Musicman wrote: > Was it here I read that proctrack/linuxproc was better than > proctrack/cgroup? I think you're thinking of JobAcctGatherType, but even then our experience there was that jobacct_gather/cgroup was more accurate. -- Christopher Samuel

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Christopher Samuel
you are running in for your SSH session and not the job! cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Delete jobs from slurmctld runtime database

2017-08-23 Thread Christopher Samuel
s expiry parameters), and removing them will likely break its statistics and probably do Bad Things(tm). Here be dragons.. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Fair resource scheduling

2017-08-27 Thread Christopher Samuel
hen put serial jobs at the end of the available nodes rather than using a best fit algorithm. This may reduce resource fragmentation for some work- loads. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Me

[slurm-dev] Re: Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

2017-08-30 Thread Christopher Samuel
-a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j 1695151 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Exceeded job memory limit problem

2017-09-06 Thread Christopher Samuel
e constrain jobs via cgroups and have found that using the cgroup plugin for this results in jobs not getting killed incorrectly. Using cgroups in Slurm is a definite win for us, so I would suggest looking into it if you've not already done so. All the best, Chris -- Christopher Samuel

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel
ach HT unit a core to run a job on. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Christopher Samuel
pute bound the usual advice is to disable HT in the BIOS, but for I/O bound things you may not be so badly off. Hope that helps! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: On the need for slurm uid/gid consistency

2017-09-13 Thread Christopher Samuel
ent is allowed to decode it. So if the UID's & GID's of the user differ across systems then it appears it will not allow the receiver to validate the message. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of M

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel
about the actual hardware layout. What does "lscpu" say? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Christopher Samuel
ple of questions: 1) Have you restarted slurmctld and slurmd everywhere? 2) Can you confirm that slurm.conf is the same everywhere? 3) what does slurmd -C report? cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Emai

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-17 Thread Christopher Samuel
eadsPerCore configured. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Christopher Samuel
n into this container. Setting the Contain implicitly sets the Alloc flag. -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel
ensure that they can run jobs, but that's a separate issue to whether slurmdbd can resolve users in LDAP. I would hope that Bright would have the ability to do that for you rather than having you handle it manually, but that's a question for Bright. Best of luck, Chris -- Christopher Sa

[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Christopher Samuel
e and assign/change their target. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

<    1   2   3   4   5   >