We're using icinga2 storing accounting data in influxdb for grafana
dashboards. In terms of monitoring I prefere end-user functionality, so
apart from services we also have a plugin that submits a jobs to cluster
(to idle nodes, with a few minutes of deadline) the job simply creates
files on shared
> On Jan 18, 2018, at 4:34 PM, Lachlan Musicman wrote:
>
> On 19 January 2018 at 07:29, Ryan Novosielski wrote:
> Hi all,
>
> Looked back at the mailing list to see if there was a question about this
> already. There was some mention of /using/ Nagios, but no real mention of
> specifics. What
We're moving to Prometheus for lots of our monitoring functions. We've got
nagios and ganglia in place, but Prometheus and Grafana makes a really nice
combo for monitoring and alerting.
There's even an exporter for Slurm-
https://github.com/vpenso/prometheus-slurm-exporter that includes node
data
On 19 January 2018 at 07:29, Ryan Novosielski wrote:
> Hi all,
>
> Looked back at the mailing list to see if there was a question about this
> already. There was some mention of /using/ Nagios, but no real mention of
> specifics. What do people monitor with Nagios? We monitor, so far,
> slurmctld
Hi all,
Looked back at the mailing list to see if there was a question about this
already. There was some mention of /using/ Nagios, but no real mention of
specifics. What do people monitor with Nagios? We monitor, so far, slurmctld,
slurmdbd, and MySQL, but there are probably some others. Migh
Hello Arielle,
I don't have a full answer, but here is a start:
Yes, you first need at least
"AccountingStorageEnforce=associations,limits" (and qos is you want to
use it) so that the limits you set are enforced (see
https://slurm.schedmd.com/resource_limits.html)
Then you can set limits fo
Hi,
slurm is installed in a minimal configuration for a cluster of
3000cores/170 nodes.We have 4 partitions, one for each type of nodes;
each partition is available for all users.
We want to prevent each user from taking more than 1000 cores running on
up to 50 jobs on all the cluster, and I'
So EasyBuild + Lmod seems the best solution. I'll try. :)
Thank you all!
betta
2018-01-17 17:53 GMT+01:00 Christopher Samuel :
> On 18/01/18 03:50, Patrick Goetz wrote:
>
> Can anyone shed some light on the situation? I'm very surprised that
>> a module script isn't just an explicit command that
>Nadav Toledo writes:
>
>> Nadav Toledo writes:
>>
>> Hey everyone,
>>
>> We've just setup a slurm cluster with few nodes each has 16 cores.
>> Is it possible to submit a job for 17cores or more?
>> If not, is there a workaround?
>>
>> Thanks in advance, Nadav
>>
>>
>> It should be possible. H