[slurm-users] Verbose mode of the 'accel-bind' does not work.

2019-11-26 Thread Uemoto, Tomoki
Hi, all OS Version: RHEL 7.6 SLURM Version: slurm 18.08.6 I defined the gpu resource as follows: [test@ohpc137pbsop-c001 ~]$ scontrol show config |grep TaskPlugin TaskPlugin = task/cgroup TaskPluginParam = (null type) [test@ohpc137pbsop-c001 ~]$

Re: [slurm-users] Filter slurm e-mail notification

2019-11-26 Thread Brian Andrus
I guess you need to decide how to approach it. If you can't educate users as to how to appropriately use the -mail flag, then you are assuming they will abuse it. In that situation, you need to configure your mail server itself to rate limit or something. That approach depends on the mail

Re: [slurm-users] slurm reporting

2019-11-26 Thread Mark Hahn
Would Grafana do similar job as XDMoD? I was wondering whether to pipe up. I work for ComputeCanada, which runs a number of significant clusters. During a major upgrade a few years ago, we looked at XDMoD, and decided against it. Primarily because we wanted greater flexibility - we have

Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
Once you added enough to ingest the Slurm logs into Influx or whatever, it could be similar. XDMoD already has the pieces in place to dig through your hierarchy of PIs, users, etc. Plus some built-in queries for correlating job size to wait time, for example:

Re: [slurm-users] slurm reporting

2019-11-26 Thread Ricardo Gregorio
Mike, It sounds interesting...In fact I had come across XDMoD this morning while "searching" for further info... Would Grafana do similar job as XDMoD? -Original Message- From: slurm-users On Behalf Of Renfro, Michael Sent: 26 November 2019 16:14 To: Slurm User Community List

Re: [slurm-users] slurm reporting

2019-11-26 Thread Renfro, Michael
> • Total number of jobs submitted by user (daily/weekly/monthly) > • Average queue time per user (daily/weekly/monthly) > • Average job run time per user (daily/weekly/monthly) Open XDMoD for these three. https://github.com/ubccr/xdmod , plus https://xdmod.ccr.buffalo.edu

[slurm-users] slurm reporting

2019-11-26 Thread Ricardo Gregorio
Hi all, I am new to both HPC and SLURM. I have been trying to run some usage reports (using sreport and sacct); but I cannot find a way to get the following info: * Total number of jobs submitted by user (daily/weekly/monthly) * Average queue time per user (daily/weekly/monthly) *

[slurm-users] scontrol not removing DRAINED

2019-11-26 Thread Rick Van Conant
If a node is marked as DOWN after it has been DRAINED, why is the node still showing DRAINED instead of DOWN? Rick Van Conant Systems Administrator SCD/SCF/HPC Fermi National Accelertator Laboratory 630-840-8747 office www.fnal.gov Connect with Fermilab Facebook |

Re: [slurm-users] good practices

2019-11-26 Thread Eli V
Inline below On Tue, Nov 26, 2019 at 5:50 AM Loris Bennett wrote: > > Hi Nigella, > > Nigella Sanders writes: > > > Thank you all for such interesting replies. > > > > The --dependency option is quite useful but in practice it has some > > inconvenients. Firstly, all 20 jobs are instantly

Re: [slurm-users] good practices

2019-11-26 Thread Loris Bennett
Hi Nigella, Nigella Sanders writes: > Thank you all for such interesting replies. > > The --dependency option is quite useful but in practice it has some > inconvenients. Firstly, all 20 jobs are instantly queued which some > users may be interpreting as an abusive use of common resources.

Re: [slurm-users] [External] Re: Filter slurm e-mail notification

2019-11-26 Thread Florian Zillner
Hi, I guess you could use a lua script to filter out flags you don't want. I haven't tried it with mail flags, but I'm using a script like the one referenced to enforce accounts/time limits, etc. https://funinit.wordpress.com/2018/06/07/how-to-use-job_submit_lua-with-slurm/ Cheers, Florian

Re: [slurm-users] good practices

2019-11-26 Thread Nigella Sanders
Thank you all for such interesting replies. The --dependency option is quite useful but in practice it has some inconvenients. Firstly, all 20 jobs are *instantly queued* which some users may be interpreting as an abusive use of common resources. Even worse, if a job fails, the rest one will stay