On 26/09/16 16:51, Lachlan Musicman wrote:
> Does this mean that it's now considered acceptable to run cgroups for
> ProcTrackType?
We've been running with that on all our x86 clusters since we switched
to Slurm, haven't seen an issue yet.
All the best,
Chris
--
Christopher Samuel
Hi,
I am setting up a heterogeneous cluster for a graduate class project and
I am using Slurm as the resource manager. The cluster consists of 64
Raspberry Pi 3s, 32 pine64s, and 12 Nvidia TX1s. I have gotten Slurm to
run across the different nodes with each architecture type setup as a
On 27/09/16 23:54, Barbara Krasovec wrote:
> The version of the client and server is the same. I guess the problem is
> in the slurmctld state file, where the slurm protocol version of some
> worker nodes must be wrong.
I suspect this is bug 3050 - we hit it for frontend nodes on BlueGene/Q
and
On 26/09/16 17:48, Philippe wrote:
> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received
So that's some external process sending one of those two signals to
slurmctld, it's not something it's choosing to do at all. We've never
seen this.
One other question - you've got the
I am surprised how hard I found it to find these as well - especially given
how frequently the question is asked.
This mob have made one, and it looks good, but all development has happened
on .deb systems, and I didn't have sufficient time (or skill) to unpack and
repack for rpm or generic.
On 26/09/16 17:48, Philippe wrote:
> [2016-09-26T08:01:44.792] debug: slurmdbd: Issue with call
> DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to
> accounting yet)
Not related - but it looks like whilst it's been told to talk to
slurmdbd you haven't added the cluster to
On 27/09/16 17:40, Philippe wrote:
> /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null
I think you want to check whether that's really restarting it or just
doing an "scontrol reconfigure" which won't (shouldn't) restart it.
--
Christopher SamuelSenior Systems
Hi!
I upgraded slurm from 15.08.5 to 16.05.2 and get errors on worker nodes:
[2016-09-27T11:56:38.881] error: Invalid Protocol Version ...
[2016-09-27T11:56:38.881] error: slurm_receive_msg_and_forward:
Incompatible versions of client and server code
[2016-09-27T11:56:38.891] error:
Nice. I might recommend grafana. There is a nice dashboard for that here:
http://giovannitorres.me/graphing-sdiag-with-graphite.html
Then there are other diamond collectors you can use to gather other
statistics:
https://github.com/fasrc/slurm-diamond-collector
Grafana is pretty flexible
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Philippe,
> If I can't use logrotate, what must I use ? I just disabled it, I'm
> gonna see if the problem still persist.
You can use logrotate. I'd suggest using a much larger "size" though.
For example, we don't rotate the logs until at least
Hello all. What are the thoughts on a Slurm 'dashboard'. The purpose being to
display cluster status on a large screen monitor.
I rather liked the look of this, based on dashing,io
https://github.com/julcollas/dashing-slurm/blob/master/README.md
Sadly dashing.io is not being supported, and
Hi,
we are running 15.08.12 an it is working for me. However I got that error
message when I forgot to add to which cluster I wanted to send the scontrol
command and it ended on another cluster with a different set of valid jobids.
Of course if the jobs have finished its run you would get the
> I'd recommend taking a look at bf_min_prio_resv (16.05 feature).
I did this: it really sped up the backfill scheduler.
Thank you,
Ulf
--
___
Dr. Ulf Markwardt
Technische Universität Dresden
Center for Information Services and
Hello,
I can confirm, that updating multiple jobs works in 16.05.4 (as
documented). As a side not it's quite handy, we could use similar
functionality across all scontrol update/show commands. Currently it is
possible to show multiple nodes, update them, but it's only possible to
update multiple
Hi,
The update jobs section of the manpage for scontrol 15.08.8 says
JobId=
Identify the job(s) to be updated. The job_list may be a comma
separated list of job IDs.
However, trying this, I get the following error:
$ scontrol update jobid=1135541,1135542 timelimit=+1:00:00
Hi John,
thanks for the reply :)
Yes I've got logrotate enabled for my slurm :
/var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log
/var/log/slurm/slurmdbd.log {
compress
missingok
nocopytruncate
nocreate
nodelaycompress
nomail
notifempty
noolddir
rotate 12
sharedscripts
16 matches
Mail list logo