[slurm-dev] Re: CGroups

2016-09-27 Thread Christopher Samuel
On 26/09/16 16:51, Lachlan Musicman wrote: > Does this mean that it's now considered acceptable to run cgroups for > ProcTrackType? We've been running with that on all our x86 clusters since we switched to Slurm, haven't seen an issue yet. All the best, Chris -- Christopher Samuel

[slurm-dev] Job Pack with Slurm

2016-09-27 Thread Aaron Young
Hi, I am setting up a heterogeneous cluster for a graduate class project and I am using Slurm as the resource manager. The cluster consists of 64 Raspberry Pi 3s, 32 pine64s, and 12 Nvidia TX1s. I have gotten Slurm to run across the different nodes with each architecture type setup as a

[slurm-dev] Re: Invalid Protocol Version

2016-09-27 Thread Christopher Samuel
On 27/09/16 23:54, Barbara Krasovec wrote: > The version of the client and server is the same. I guess the problem is > in the slurmctld state file, where the slurm protocol version of some > worker nodes must be wrong. I suspect this is bug 3050 - we hit it for frontend nodes on BlueGene/Q and

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
On 26/09/16 17:48, Philippe wrote: > [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) received So that's some external process sending one of those two signals to slurmctld, it's not something it's choosing to do at all. We've never seen this. One other question - you've got the

[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Lachlan Musicman
I am surprised how hard I found it to find these as well - especially given how frequently the question is asked. This mob have made one, and it looks good, but all development has happened on .deb systems, and I didn't have sufficient time (or skill) to unpack and repack for rpm or generic.

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
On 26/09/16 17:48, Philippe wrote: > [2016-09-26T08:01:44.792] debug: slurmdbd: Issue with call > DBD_CLUSTER_CPUS(1407): 4294967295(This cluster hasn't been added to > accounting yet) Not related - but it looks like whilst it's been told to talk to slurmdbd you haven't added the cluster to

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Christopher Samuel
On 27/09/16 17:40, Philippe wrote: > /usr/sbin/invoke-rc.d --quiet slurm-llnl reconfig >/dev/null I think you want to check whether that's really restarting it or just doing an "scontrol reconfigure" which won't (shouldn't) restart it. -- Christopher SamuelSenior Systems

[slurm-dev] Invalid Protocol Version

2016-09-27 Thread Barbara Krasovec
Hi! I upgraded slurm from 15.08.5 to 16.05.2 and get errors on worker nodes: [2016-09-27T11:56:38.881] error: Invalid Protocol Version ... [2016-09-27T11:56:38.881] error: slurm_receive_msg_and_forward: Incompatible versions of client and server code [2016-09-27T11:56:38.891] error:

[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Paul Edmon
Nice. I might recommend grafana. There is a nice dashboard for that here: http://giovannitorres.me/graphing-sdiag-with-graphite.html Then there are other diamond collectors you can use to gather other statistics: https://github.com/fasrc/slurm-diamond-collector Grafana is pretty flexible

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Philippe, > If I can't use logrotate, what must I use ? I just disabled it, I'm > gonna see if the problem still persist. You can use logrotate. I'd suggest using a much larger "size" though. For example, we don't rotate the logs until at least

[slurm-dev] Slurm web dashboards

2016-09-27 Thread John Hearns
Hello all. What are the thoughts on a Slurm 'dashboard'. The purpose being to display cluster status on a large screen monitor. I rather liked the look of this, based on dashing,io https://github.com/julcollas/dashing-slurm/blob/master/README.md Sadly dashing.io is not being supported, and

[slurm-dev] Re: scontrol: update multiple jobs?

2016-09-27 Thread Pancorbo, Juan
Hi, we are running 15.08.12 an it is working for me. However I got that error message when I forgot to add to which cluster I wanted to send the scontrol command and it ended on another cluster with a different set of valid jobids. Of course if the jobs have finished its run you would get the

[slurm-dev] Re: Backfill scheduler should look at all jobs

2016-09-27 Thread Ulf Markwardt
> I'd recommend taking a look at bf_min_prio_resv (16.05 feature). I did this: it really sped up the backfill scheduler. Thank you, Ulf -- ___ Dr. Ulf Markwardt Technische Universität Dresden Center for Information Services and

[slurm-dev] Re: scontrol: update multiple jobs?

2016-09-27 Thread Maciej Pawlik
Hello, I can confirm, that updating multiple jobs works in 16.05.4 (as documented). As a side not it's quite handy, we could use similar functionality across all scontrol update/show commands. Currently it is possible to show multiple nodes, update them, but it's only possible to update multiple

[slurm-dev] scontrol: update multiple jobs?

2016-09-27 Thread Loris Bennett
Hi, The update jobs section of the manpage for scontrol 15.08.8 says JobId= Identify the job(s) to be updated. The job_list may be a comma separated list of job IDs. However, trying this, I get the following error: $ scontrol update jobid=1135541,1135542 timelimit=+1:00:00

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread Philippe
Hi John, thanks for the reply :) Yes I've got logrotate enabled for my slurm : /var/log/slurm/slurmd.log /var/log/slurm/slurmctld.log /var/log/slurm/slurmdbd.log { compress missingok nocopytruncate nocreate nodelaycompress nomail notifempty noolddir rotate 12 sharedscripts