about.
That said, to my untutored eye that looks more like a munge problem than
anything else - you will want to check that your keys are the same and
that your clocks are in sync (NTP is your friend).
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian
uggest benchmarking
with your current configuration versus disabling HT and running on real
cores only.
Basically, whichever gets you better throughput should be your default
config.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Compu
o really grok what it's saying..
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 18/09/16 03:45, John DeSantis wrote:
> Try adding a "DefMemPerCPU" statement in your partition definitions, e.g
You can also set that globally.
# Global default for jobs - request 2GB per core wanted.
DefMemPerCPU=2048
All the best,
Chris
--
Christopher SamuelS
al RAM/core ratio on the
low memory nodes on one system, it's 1/8th of the low memory nodes on
another system so making it lower doesn't buy us much and 2 GB/core
means most NAMD jobs will run without issues.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
V
udburst and were convinced it was
related to that, but this looks like the actual problem.
Now why slurmctld doesn't upgrade that information on an upgrade is
another matter altogether.
Thanks!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Scien
urm.conf, this one escaped me!
There's always new ones there, I swear they're breeding..
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www
On 19/09/16 22:58, Christopher Samuel wrote:
> Thanks so much Ulf, you've just answered a puzzle I've been seeing on an
> x86 cluster I'm helping out with!
...and stopping slurmctld & slurmd's (slurmdbd was left going), moving
/var/spool/slurm/jobs/node_stat
ce.
SSH host based authentication within the cluster helps with that, along
with caching SSH keys in /etc/ssh/ssh_known_hosts.
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone:
up in a later release (can't remember when sorry!).
Hope that helps,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
re how far back you
> can go, but I suspect 14.x talking to a 16.x dbd would be fine.
Slurm supports 2 major releases behind.
So a 16.05.x slurmdbd should talk to 15.08.x and 14.11.x
slurmctld's but *not* 14.03.x.
All the best,
Chris
--
Christopher SamuelSenior Systems Admi
clusters which
is around 3GB for 8 million job steps.
Neither cause us any issues these days (we used to have a problem when,
for complicated historical reasons, slurmdbd was running on a 32-bit VM
and could run out of memory).
Admittedly we do have beefy database servers. :-)
All the best,
Chris
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 27/09/16 17:40, Philippe wrote:
> /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null
I think you want to check whether that's really restarting it or just
doing an "scontrol reconfigure" which won't (shouldn't) restart it.
--
Christoph
7;t added the cluster to slurmdbd with "sacctmgr" yet
so I suspect all your accounting info is getting lost.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
estion - you've got the shutdown log from slurmctld and the
start log of a slurmd - what happens when slurmctld starts up?
That might be your clue about why yours jobs are getting killed.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Comput
of offline nodes and make a script to restore via scontrol
2) shutdown slurmctld and all slurmds
3) move the node_stat* files out of the way
4) start up slurmd again
5) start up slurmctld
6) run the script created at step 1
Hope that helps!
All the best,
Chris
--
Christopher SamuelSenior Sy
On 26/09/16 16:51, Lachlan Musicman wrote:
> Does this mean that it's now considered acceptable to run cgroups for
> ProcTrackType?
We've been running with that on all our x86 clusters since we switched
to Slurm, haven't seen an issue yet.
All the best,
Chris
--
On 28/09/16 16:25, Barbara Krasovec wrote:
> Yes, this worked! Thank you very much for your help!
My pleasure!
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
h
On 29/09/16 01:16, John DeSantis wrote:
> We get the same snippet when our logrotate takes action against the
> cltdlog:
Does your slurmctld restart then too?
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Emai
AP lookup to rewrite users email to the value in
LDAP)
But really this isn't a Slurm issue, it's a host config issue for Postfix.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimel
s as well and if they're out of step well then GPFS will stop
working on the node making Slrm the least of your worries. :-)
So just run ntpd.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@
ned up.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Login Proper Name Used Energy
- --- - ---
avoca vlscisamuel Christopher Sa+ 151030
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +
ct that's what's triggering the different display in sreport, a
line per association/partition.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.a
tions aren't
getting blocked, and also check that the hostname correctly resolves.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
re already running
disparate number of jobs using variable cores, how do I see what cores
on what nodes Slurm has allocated my running job?
I know I can go and poke around with cgroups, but is there a way to get
that out of squeue, sstat or sacct?
All the best,
Chris
--
Christopher Samuel
I guess you will have to query the
> cgroup hierarchy.
No need, I'm just trying to automate the detection of bad jobs which are
spanning nodes but not using the cores on the other nodes and I wanted a
way to quantify how many cores were being wasted by the job.
Thanks again!
Chris
--
Ch
On 28/10/16 08:44, Lachlan Musicman wrote:
> So I checked the system, noticed that one node was drained, resumed it.
> Then I tried both
>
> scontrol requeue 230591
> scontrol resume 230591
What happens if you "scontrol hold" it first before "scontrol release&quo
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ontact them directly.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 02/11/16 02:01, Riebs, Andy wrote:
> Interesting -- thanks for the info Chris.
No worries, it's a bit sad I think, but I can understand it.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.
tion in a partition-oriented format. This is ignored if
the --format option is specified.
Except it's not being ignored when you use --format (-o).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation
any period of time that information
will be lost.
We build from source and use:
StateSaveLocation = /var/spool/slurm/jobs
but the decision is yours where exactly to put it.
But /tmp is almost certainly the second worst place (after /dev/shm).
All the best,
Chris
--
Christopher Samue
cpu=6,mem=4G,node=1mic:1
6449483.extern extern cpu=6,mem=4G,node=1mic:1
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
htt
On 09/11/16 09:50, Lachlan Musicman wrote:
> I don't know Chris, I think that /dev/null would rate tbh. :)
Ah, but that's a file (OK character special device), not a directory. ;-)
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Science
is that the batch step is of course
only on the first node, but it says it was allocated 2 GRES.
I suspect that's just a symptom of Slurm only keeping a total
number.
I don't think Slurm can give you an uneven GRES allocation, but
the SchedMD folks would need to confirm that I'm af
ic:1 Reservation=(null)
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Having private containers is on the roadmap for Shifter.
Shifter also integrates with Slurm.
All the best!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://
but apparently with Slurm it's not - just tested it out and using:
--gres mic
results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0
set in its environment.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Comp
On 17/11/16 11:31, Christopher Samuel wrote:
> It depends on the library used to pass options,
Oops - that should be parse, not pass.
Need more caffeine..
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email:
ng.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
rwise you're at the mercy of what your mpiexec chooses to do.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ources.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
JobAcctGatherType=jobacct_gather/cgroup
If the former, try the latter and see if it helps get better numbers (we
went to the former after suggestions from SchedMD but from highly
unreliable memory had to revert due to similar issues to those you are
seeing).
Best of luck,
Chris
--
Christopher S
cause of this issue (from memory).
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ml
Hope this helps!
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
I strongly believe that will be necessary, sorry!
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
puts between tasks.
OK, I'm not sure how Slurm will behave with multiple srun's and cons_res
and CR_LLN but it's still worth a shot.
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@uni
rm?
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
into. You do need PrologFlags=contain for that to ensure that all
jobs get an "extern" batch step on job creation for these processes to
be adopted into.
We use both here with great success.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Vi
On 10/01/17 10:57, Christopher Samuel wrote:
> If you are unlucky enough to have SSH based job launchers then you would
> also look at the BYU contributed pam_slurm_adopt
Actually this is useful even without that as it allows users to SSH into
a node they have a job on and not disturb the
he best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 10/01/17 18:56, Ole Holm Nielsen wrote:
> For the record: Torque will always send mail if a job is aborted
It's been a few years since I've used Torque so I don't remember that
behaviour.
Thanks for the info!
--
Christopher SamuelSenior Systems Administrator
me to their registered email
address that's stored in LDAP.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
area is a high-performance
parallel filesystem shared across all nodes).
https://github.com/vlsci/spank-private-tmp
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3
;s been because the slurmdbd cannot connect
back to slurmctld to send RPCs on the IP address that slurmctld has
registered with slurmdbd.
What does this say?
sacctmgr list clusters
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computatio
ay. :-(
Might be a feature request..
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
[...]
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
t of
architectures) individually.
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ck up again.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
Torque+Moab/Maui here and at VPAC
before that - we would always start Moab paused so we could check out
what impact any changes had to our queues & priorities before starting
jobs running.
Measure twice, cut once.
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
VL
ystems having no more than 2500 nodes
or the cube root for larger systems. The value may not exceed
65533.
If so then I suspect that this is a possible transient DNS failure?
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI -
be useful to us here too.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ps with srun you can
also monitor them as the job is going with 'sstat' (rather than just
post-mortem with sacct).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
7;t really
blame Slurm for not catering to this. It can use cgroups to partition
cores to jobs precisely so it doesn't need to care what the load average
is - it knows the kernel is ensuring the cores the jobs want are not
being stomped on by other tasks.
Best of luck!
Chris
--
Christop
e best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
save/job.830332/environment, No such
> file or directory
I would suggest that you are looking at transient NFS failures (which
may not be logged).
Are you using NFSv3 or v4 to talk to the NFS server and what are the
OS's you are using for both?
cheers,
Chris
--
Christopher Samue
ep' of the source code after reading 'man sacct' and not
finding anything (also running 'sacct -e' and not seeing anything useful
there either) doesn't offer much hope.
Anyone else dealing with this?
We're on 16.05.x at the moment with slurmdbd.
All the best
0 NodeAddr=thing-knc[01-03]
RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2
You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.
Best of luck,
Chris
--
ble again.
+1 for running your own LDAP.
I would seriously look at a cluster toolkit for running nodes,
especially if it supports making a single image that your compute nodes
then netboot. That way you know everything is consistent.
Best of luck,
Chris
--
Christopher SamuelSenior Syst
so we fell back to using our own LDAP server with Karaage to manage
project/account applications, adding people to slurmdbd, etc.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
age-Cluster
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
a node and
then Slurm isn't going to put more jobs there (unless you tell it to
ignore memory, which is not likely to end well).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au
e understand what might be wrong?
Anything setting a drain state is meant to also set a reason, what does
"scontrol show node $NODE" say for these?
Also are there any relevant messages in your slurmctld and slurmd logs?
Best of luck,
Chris
--
Christopher SamuelSenior Systems Ad
by
> a job have finished at completion?
Are you not using cgroups for enforcement?
Usually that picks everything up.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
MPI launchers (and other naughtiness).
Good luck!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
nfo --format="%60N %.15G %.30E %.10A"
The reason can be quite long, but there doesn't seem to be a way to just
show the status as down/drain/idle/etc.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email
e time.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 06/06/17 23:46, Edward Walter wrote:
> Doesn't that functionality come from a spank plugin?
> https://github.com/hautreux/slurm-spank-x11
Yes, that's the one we use. Works nicely.
Provides the --x11 option for srun.
All the best,
Chris
--
Christopher Samuel
R} format=MaxJobsPerUser
For a more general view you would do:
sacctmgr list user ${USER} withassoc
Hope this helps,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
limits.html
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 07/08/17 14:08, Lachlan Musicman wrote:
> In slurm.conf, there is a RebootProgram - does this need to be a direct
> link to a bin or can it be a command?
We have:
RebootProgram = /sbin/reboot
Works for us.
cheers,
Chris
--
Christopher SamuelSenior Systems Adminis
own for that.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 14/08/17 08:55, Lachlan Musicman wrote:
> Was it here I read that proctrack/linuxproc was better than
> proctrack/cgroup?
I think you're thinking of JobAcctGatherType, but even then our
experience there was that jobacct_gather/cgroup was more accurate.
--
Christopher Samuel
you are running in
for your SSH session and not the job!
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
s expiry parameters), and removing them will likely break its
statistics and probably do Bad Things(tm).
Here be dragons..
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
hen put serial jobs at
the end of the available nodes rather than using a best fit
algorithm. This may reduce resource fragmentation for some work-
loads.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Me
-a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j
1695151
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
e constrain jobs via cgroups and have found that using the cgroup
plugin for this results in jobs not getting killed incorrectly.
Using cgroups in Slurm is a definite win for us, so I would suggest
looking into it if you've not already done so.
All the best,
Chris
--
Christopher Samuel
ach HT unit a core to run a job on.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
pute bound the usual advice
is to disable HT in the BIOS, but for I/O bound things you may not be so
badly off.
Hope that helps!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ent is allowed to decode it.
So if the UID's & GID's of the user differ across systems then it
appears it will not allow the receiver to validate the message.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of M
about the
actual hardware layout.
What does "lscpu" say?
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ple of questions:
1) Have you restarted slurmctld and slurmd everywhere?
2) Can you confirm that slurm.conf is the same everywhere?
3) what does slurmd -C report?
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Emai
eadsPerCore configured.
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
n into this container. Setting the Contain
implicitly sets the Alloc flag.
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ensure that they can run jobs, but that's a separate issue to whether
slurmdbd can resolve users in LDAP.
I would hope that Bright would have the ability to do that for you
rather than having you handle it manually, but that's a question for Bright.
Best of luck,
Chris
--
Christopher Sa
e and assign/change their
target.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
301 - 400 of 497 matches
Mail list logo