rs can't override it? or any way to disable --mem and --mem-per-cpu ?
I believe you can restrict the amount of memory jobs can use via TRES
functionality:
https://slurm.schedmd.com/tres.html
it's not something we do here though.
Best of luck!
Chris
--
Christopher Samuel
re tied into using AD (which it sounds like you are) then that's
not really an option for you.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
cctmgr list config | fgrep Purge
cheer,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 14/10/17 00:24, Doug Meyer wrote:
> The job_table.idb and step_table.idb do not clear as part of day-to-day
> slurmdbd.conf
>
> Have slurmdbd.conf set to purge after 8 weeks but this does not appear
> to be working.
Anything in your slurmdbd logs?
--
Christopher Samue
On 05/10/17 11:27, Christopher Samuel wrote:
> PMIX v1.2.2: Slurm complains and tells me it wants v2.
I think that was due to a config issue on the system I was helping out
with, after having to install some extra packages (like a C++ compiler)
to get other things working I can no lon
's Slurm 16.05.8. Do you see the same?
Did you try both having CR_Pack_Nodes *and* specifying this?
-n 17 --ntasks-per-node=4
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
tried to stop and
> restart it multiple times but still not working. Please see the error below.
Check your slurmctld.log, that should have hints about why it won't start.
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The Universit
tation
for PMIX in Slurm seems pretty much non-existent. :-(
Anyone had any luck with this?
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ackages for Slurm, I'd always
install it centrally (NFS exported to compute nodes) to keep things
simple. That way you decouple your Slurm version from the OS and can
keep it up to date (or keep it on a known working version).
All the best!
Chris
--
Christopher SamuelSenior Syste
the cluster.
We also have in our taskprolog.sh:
echo export BASH_ENV=/etc/profile.d/module.sh
to try and ensure that bash shells have modules set up, just in case. :-)
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@u
lose any running jobs.
The on disk format might for spooled jobs may also change between
releases too, so you probably want to keep that in mind as well..
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.
to sometime far into the future to have
> effectively an infinite period (no reset)?
Basically this is because once a user exceeds something like their
maximum CPU run time limit then they will never be able to run jobs
again unless you either decay or reset usage.
--
Christopher Samue
On 02/10/17 20:51, Sysadmin CAOS wrote:
> I'm execution my MPI program with "mpirun"... Maybe could be this the
> problem? Do I need to execute with "srun"?
I suspect so, try it and see..
--
Christopher SamuelSenior Systems Administrator
Melbourne Bio
7;ll be off-air for quite a while. Good luck!
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
more information from applicants than what it captures by
default, but that's the nice thing, it is modular.
Also includes Shibboleth support.
All the best!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.
e and assign/change their
target.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ensure that they can run jobs, but that's a separate issue to whether
slurmdbd can resolve users in LDAP.
I would hope that Bright would have the ability to do that for you
rather than having you handle it manually, but that's a question for Bright.
Best of luck,
Chris
--
Christopher Sa
n into this container. Setting the Contain
implicitly sets the Alloc flag.
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
eadsPerCore configured.
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ple of questions:
1) Have you restarted slurmctld and slurmd everywhere?
2) Can you confirm that slurm.conf is the same everywhere?
3) what does slurmd -C report?
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Emai
about the
actual hardware layout.
What does "lscpu" say?
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ent is allowed to decode it.
So if the UID's & GID's of the user differ across systems then it
appears it will not allow the receiver to validate the message.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of M
pute bound the usual advice
is to disable HT in the BIOS, but for I/O bound things you may not be so
badly off.
Hope that helps!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ach HT unit a core to run a job on.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
e constrain jobs via cgroups and have found that using the cgroup
plugin for this results in jobs not getting killed incorrectly.
Using cgroups in Slurm is a definite win for us, so I would suggest
looking into it if you've not already done so.
All the best,
Chris
--
Christopher Samuel
-a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j
1695151
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
hen put serial jobs at
the end of the available nodes rather than using a best fit
algorithm. This may reduce resource fragmentation for some work-
loads.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Me
s expiry parameters), and removing them will likely break its
statistics and probably do Bad Things(tm).
Here be dragons..
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
you are running in
for your SSH session and not the job!
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 14/08/17 08:55, Lachlan Musicman wrote:
> Was it here I read that proctrack/linuxproc was better than
> proctrack/cgroup?
I think you're thinking of JobAcctGatherType, but even then our
experience there was that jobacct_gather/cgroup was more accurate.
--
Christopher Samuel
own for that.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 07/08/17 14:08, Lachlan Musicman wrote:
> In slurm.conf, there is a RebootProgram - does this need to be a direct
> link to a bin or can it be a command?
We have:
RebootProgram = /sbin/reboot
Works for us.
cheers,
Chris
--
Christopher SamuelSenior Systems Adminis
limits.html
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
R} format=MaxJobsPerUser
For a more general view you would do:
sacctmgr list user ${USER} withassoc
Hope this helps,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
On 06/06/17 23:46, Edward Walter wrote:
> Doesn't that functionality come from a spank plugin?
> https://github.com/hautreux/slurm-spank-x11
Yes, that's the one we use. Works nicely.
Provides the --x11 option for srun.
All the best,
Chris
--
Christopher Samuel
e time.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
nfo --format="%60N %.15G %.30E %.10A"
The reason can be quite long, but there doesn't seem to be a way to just
show the status as down/drain/idle/etc.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email
MPI launchers (and other naughtiness).
Good luck!
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
by
> a job have finished at completion?
Are you not using cgroups for enforcement?
Usually that picks everything up.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
e understand what might be wrong?
Anything setting a drain state is meant to also set a reason, what does
"scontrol show node $NODE" say for these?
Also are there any relevant messages in your slurmctld and slurmd logs?
Best of luck,
Chris
--
Christopher SamuelSenior Systems Ad
a node and
then Slurm isn't going to put more jobs there (unless you tell it to
ignore memory, which is not likely to end well).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au
age-Cluster
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
so we fell back to using our own LDAP server with Karaage to manage
project/account applications, adding people to slurmdbd, etc.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ble again.
+1 for running your own LDAP.
I would seriously look at a cluster toolkit for running nodes,
especially if it supports making a single image that your compute nodes
then netboot. That way you know everything is consistent.
Best of luck,
Chris
--
Christopher SamuelSenior Syst
0 NodeAddr=thing-knc[01-03]
RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2
You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.
Best of luck,
Chris
--
ep' of the source code after reading 'man sacct' and not
finding anything (also running 'sacct -e' and not seeing anything useful
there either) doesn't offer much hope.
Anyone else dealing with this?
We're on 16.05.x at the moment with slurmdbd.
All the best
save/job.830332/environment, No such
> file or directory
I would suggest that you are looking at transient NFS failures (which
may not be logged).
Are you using NFSv3 or v4 to talk to the NFS server and what are the
OS's you are using for both?
cheers,
Chris
--
Christopher Samue
e best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
7;t really
blame Slurm for not catering to this. It can use cgroups to partition
cores to jobs precisely so it doesn't need to care what the load average
is - it knows the kernel is ensuring the cores the jobs want are not
being stomped on by other tasks.
Best of luck!
Chris
--
Christop
ps with srun you can
also monitor them as the job is going with 'sstat' (rather than just
post-mortem with sacct).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
be useful to us here too.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
ystems having no more than 2500 nodes
or the cube root for larger systems. The value may not exceed
65533.
If so then I suspect that this is a possible transient DNS failure?
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI -
Torque+Moab/Maui here and at VPAC
before that - we would always start Moab paused so we could check out
what impact any changes had to our queues & priorities before starting
jobs running.
Measure twice, cut once.
cheers!
Chris
--
Christopher SamuelSenior Systems Administrator
VL
ck up again.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
t of
architectures) individually.
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
[...]
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ay. :-(
Might be a feature request..
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
;s been because the slurmdbd cannot connect
back to slurmctld to send RPCs on the IP address that slurmctld has
registered with slurmdbd.
What does this say?
sacctmgr list clusters
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computatio
area is a high-performance
parallel filesystem shared across all nodes).
https://github.com/vlsci/spank-private-tmp
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3
me to their registered email
address that's stored in LDAP.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 10/01/17 18:56, Ole Holm Nielsen wrote:
> For the record: Torque will always send mail if a job is aborted
It's been a few years since I've used Torque so I don't remember that
behaviour.
Thanks for the info!
--
Christopher SamuelSenior Systems Administrator
he best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 10/01/17 10:57, Christopher Samuel wrote:
> If you are unlucky enough to have SSH based job launchers then you would
> also look at the BYU contributed pam_slurm_adopt
Actually this is useful even without that as it allows users to SSH into
a node they have a job on and not disturb the
into. You do need PrologFlags=contain for that to ensure that all
jobs get an "extern" batch step on job creation for these processes to
be adopted into.
We use both here with great success.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Vi
rm?
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
puts between tasks.
OK, I'm not sure how Slurm will behave with multiple srun's and cons_res
and CR_LLN but it's still worth a shot.
Best of luck!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@uni
I strongly believe that will be necessary, sorry!
Best of luck,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ml
Hope this helps!
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
cause of this issue (from memory).
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
JobAcctGatherType=jobacct_gather/cgroup
If the former, try the latter and see if it helps get better numbers (we
went to the former after suggestions from SchedMD but from highly
unreliable memory had to revert due to similar issues to those you are
seeing).
Best of luck,
Chris
--
Christopher S
ources.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
rwise you're at the mercy of what your mpiexec chooses to do.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ng.
cheers,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 17/11/16 11:31, Christopher Samuel wrote:
> It depends on the library used to pass options,
Oops - that should be parse, not pass.
Need more caffeine..
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email:
but apparently with Slurm it's not - just tested it out and using:
--gres mic
results in my job being scheduled on a Phi node with OFFLOAD_DEVICES=0
set in its environment.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Comp
Having private containers is on the roadmap for Shifter.
Shifter also integrates with Slurm.
All the best!
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://
ic:1 Reservation=(null)
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
is that the batch step is of course
only on the first node, but it says it was allocated 2 GRES.
I suspect that's just a symptom of Slurm only keeping a total
number.
I don't think Slurm can give you an uneven GRES allocation, but
the SchedMD folks would need to confirm that I'm af
On 09/11/16 09:50, Lachlan Musicman wrote:
> I don't know Chris, I think that /dev/null would rate tbh. :)
Ah, but that's a file (OK character special device), not a directory. ;-)
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Science
cpu=6,mem=4G,node=1mic:1
6449483.extern extern cpu=6,mem=4G,node=1mic:1
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
htt
any period of time that information
will be lost.
We build from source and use:
StateSaveLocation = /var/spool/slurm/jobs
but the decision is yours where exactly to put it.
But /tmp is almost certainly the second worst place (after /dev/shm).
All the best,
Chris
--
Christopher Samue
tion in a partition-oriented format. This is ignored if
the --format option is specified.
Except it's not being ignored when you use --format (-o).
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation
On 02/11/16 02:01, Riebs, Andy wrote:
> Interesting -- thanks for the info Chris.
No worries, it's a bit sad I think, but I can understand it.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.
ontact them directly.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
On 28/10/16 08:44, Lachlan Musicman wrote:
> So I checked the system, noticed that one node was drained, resumed it.
> Then I tried both
>
> scontrol requeue 230591
> scontrol resume 230591
What happens if you "scontrol hold" it first before "scontrol release&quo
I guess you will have to query the
> cgroup hierarchy.
No need, I'm just trying to automate the detection of bad jobs which are
spanning nodes but not using the cores on the other nodes and I wanted a
way to quantify how many cores were being wasted by the job.
Thanks again!
Chris
--
Ch
re already running
disparate number of jobs using variable cores, how do I see what cores
on what nodes Slurm has allocated my running job?
I know I can go and poke around with cgroups, but is there a way to get
that out of squeue, sstat or sacct?
All the best,
Chris
--
Christopher Samuel
tions aren't
getting blocked, and also check that the hostname correctly resolves.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
ct that's what's triggering the different display in sreport, a
line per association/partition.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.a
Login Proper Name Used Energy
- --- - ---
avoca vlscisamuel Christopher Sa+ 151030
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +
ned up.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
s as well and if they're out of step well then GPFS will stop
working on the node making Slrm the least of your worries. :-)
So just run ntpd.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@
AP lookup to rewrite users email to the value in
LDAP)
But really this isn't a Slurm issue, it's a host config issue for Postfix.
All the best,
Chris
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimel
On 29/09/16 01:16, John DeSantis wrote:
> We get the same snippet when our logrotate takes action against the
> cltdlog:
Does your slurmctld restart then too?
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Emai
On 28/09/16 16:25, Barbara Krasovec wrote:
> Yes, this worked! Thank you very much for your help!
My pleasure!
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
h
On 26/09/16 16:51, Lachlan Musicman wrote:
> Does this mean that it's now considered acceptable to run cgroups for
> ProcTrackType?
We've been running with that on all our x86 clusters since we switched
to Slurm, haven't seen an issue yet.
All the best,
Chris
--
of offline nodes and make a script to restore via scontrol
2) shutdown slurmctld and all slurmds
3) move the node_stat* files out of the way
4) start up slurmd again
5) start up slurmctld
6) run the script created at step 1
Hope that helps!
All the best,
Chris
--
Christopher SamuelSenior Sy
estion - you've got the shutdown log from slurmctld and the
start log of a slurmd - what happens when slurmctld starts up?
That might be your clue about why yours jobs are getting killed.
--
Christopher SamuelSenior Systems Administrator
VLSCI - Victorian Life Sciences Comput
1 - 100 of 497 matches
Mail list logo