Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Christopher Benjamin Coffey
Awesome thanks, I didn't know about that "scontrol -o show assoc_mgr" command ! 
Thanks guys!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 6/23/22, 10:22 AM, "slurm-users on behalf of Miguel Oliveira" 
 
wrote:

Hi Chris,
We use a python wrapper to do this but the basic command to retrieved 
account minutes is:

'scontrol -o show assoc_mgr | grep "^QOS='+account+’"'

You then have to parse the output for "GrpTRESMins=“. The output will be 
two numbers. The first is the limit, or N for no limit, while the next one in 
parenthesis is the consumed.

You can also report by user with:

'sreport -t minutes -T cpu,gres/gpu -nP cluster AccountUtilizationByUser 
start='+date_start+' end='+date_end+' account='+account+' format=login,used’

If you are willing to accept some rounding errors!

With slight variations, and some oddities, this can also be used to limit 
GPU utilisation, as is in our case as you can deduce from the previous command.

Best,

Miguel Afonso Oliveira





On 23 Jun 2022, at 17:58, Christopher Benjamin Coffey 
 wrote:

Hi Miguel, 

This is intriguing as I didn't know about this possibility, in dealing with 
fairshare, and limited priority minutes qos at the same time. How can you 
verify how many minutes have been used of this qos that has been setup with 
grptresmins ? Is that possible? Thanks.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



On 6/23/22, 9:44 AM, "slurm-users on behalf of Miguel Oliveira" 
 
wrote:

Hi Gérard,
It is not exactly true that you have no solution to limit projects. If you 
implement each project as an account then you can create an account qos with 
the NoDecay flags.
This will not affect associations so priority and fair share are not 
impacted.

The way we do it is to create a qos:

sacctmgr -i --quiet create qos "{{ item.account }}" set 
flags=DenyOnLimit,NoDecay GrpTRESMin=cpu=600


And then use this qos when the account (project) is created:

sacctmgr -i --quiet add account "{{ item.account }}" Parent="{{ item.parent 
}}" QOS="{{ item.account }}" Fairshare=1 Description="{{ item.description }}”

We even have a slurm bank implementation to play along with this technique 
and it has not failed us yet too much! :)

Hope that helps,

Miguel Afonso Oliveira



On 23 Jun 2022, at 14:57, gerard@cines.fr wrote:

Hi Ole and B/H,

Thanks for your answers.



You're right B/H, and as I tuned TRESBillingWeights option to only counts 
cpu, in my case : nb of reserved core = "TRES billing cost" 

You're right again I forgot the PriorityDecayHalfLife parameter which is 
also used by fairshare Multifactor Priority. 
We use multifactor priority to manage the priority of jobs in the queue, 
and we set the values of PriorityDecayHalfLife and PriorityUsageResetPeriod 
according to these needs.
So PriorityDecayHalfLife will decay GrpTRESRaw and GrpTRESMins can't be 
used as we want.

Setting the NoDecay flag to a QOS could be an option but I suppose it also 
impact fairshare Multifactor Priority of all jobs using this QOS.

This means I have no solution to limit a project as we want, unless schedMD 
changes its behavior or adds a new feature. 

Thanks a lot.

Regards, 
Gérard
<http://www.cines.fr/>




De: "Bjørn-Helge Mevik" 
À: slurm-us...@schedmd.com
Envoyé: Jeudi 23 Juin 2022 12:39:27
Objet: Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage




Ole Holm Nielsen  writes:

Hi Bjørn-Helge,



Hello, Ole! :)

On 6/23/22 09:18, Bjørn-Helge Mevik wrote:

Slurm the same internal variables are used for fairshare calculations as
for GrpTRESMins (and similar), so when fair share priorities are in use,
slurm will reduce accumulated GrpTRESMins over time. This means that it
is impossible(*) to use GrpTRESMins limits and fairshare
priorities at the same time.



This is a surprising observation!



I discovered it quite a few years ago, when we wanted to use Slurm to
enforce cpu hour quota limits (instead of using Maui+Gold). Can't
remember anymore if I was surprised or just sad. :D

We use a 14 days HalfLife in slurm.conf:
PriorityDecayHalfLife=14-0

Since our longest running jobs can run only 7 days, maybe our limits
never get reduced as you describe?



The accumulated usage is reduced every 5 minutes (by default; see
PriorityCalcPeriod). The reduction is done by multiplying the
accumulated usage by a number slightly less than 1. The number is
chosen so that the accumulated usage is reduced to 50 % after
PriorityDecayHal

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Christopher Benjamin Coffey
Hi Miguel, 

This is intriguing as I didn't know about this possibility, in dealing with 
fairshare, and limited priority minutes qos at the same time. How can you 
verify how many minutes have been used of this qos that has been setup with 
grptresmins ? Is that possible? Thanks.

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 6/23/22, 9:44 AM, "slurm-users on behalf of Miguel Oliveira" 
 
wrote:

Hi Gérard,
It is not exactly true that you have no solution to limit projects. If you 
implement each project as an account then you can create an account qos with 
the NoDecay flags.
This will not affect associations so priority and fair share are not 
impacted.

The way we do it is to create a qos:

sacctmgr -i --quiet create qos "{{ item.account }}" set 
flags=DenyOnLimit,NoDecay GrpTRESMin=cpu=600


And then use this qos when the account (project) is created:

sacctmgr -i --quiet add account "{{ item.account }}" Parent="{{ item.parent 
}}" QOS="{{ item.account }}" Fairshare=1 Description="{{ item.description }}”

We even have a slurm bank implementation to play along with this technique 
and it has not failed us yet too much! :)

Hope that helps,

Miguel Afonso Oliveira



On 23 Jun 2022, at 14:57, gerard@cines.fr wrote:

Hi Ole and B/H,

Thanks for your answers.



You're right B/H, and as I tuned TRESBillingWeights option to only counts 
cpu, in my case :nb of reserved core   ="TRES billing cost"   

You're right again I forgot the PriorityDecayHalfLife parameter which is 
also used by fairshare Multifactor Priority. 
We use multifactor priority to manage the priority of jobs in the queue, 
and we set the values of PriorityDecayHalfLife and PriorityUsageResetPeriod 
according to these needs.
So PriorityDecayHalfLife will decay GrpTRESRaw and  GrpTRESMins  can't be 
used as we want.

Setting  the NoDecay flag to a QOS could be an option but I suppose it also 
impact fairshare Multifactor Priority  of all jobs using this QOS.

This means I have no solution to limit a project as we want, unless schedMD 
changes its behavior or adds a new feature.  

Thanks a lot.

Regards, 
Gérard
 




De: "Bjørn-Helge Mevik" 
À: slurm-us...@schedmd.com
Envoyé: Jeudi 23 Juin 2022 12:39:27
Objet: Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage




 Ole Holm Nielsen  writes:

 Hi Bjørn-Helge,



 Hello, Ole! :)

 On 6/23/22 09:18, Bjørn-Helge Mevik wrote:

 Slurm the same internal variables are used for fairshare calculations as
 for GrpTRESMins (and similar), so when fair share priorities are in use,
 slurm will reduce accumulated GrpTRESMins over time.  This means that it
 is impossible(*) to use GrpTRESMins limits and fairshare
 priorities at the same time.



 This is a surprising observation!



 I discovered it quite a few years ago, when we wanted to use Slurm to
 enforce cpu hour quota limits (instead of using Maui+Gold).  Can't
 remember anymore if I was surprised or just sad. :D

 We use a 14 days HalfLife in slurm.conf:
 PriorityDecayHalfLife=14-0

 Since our longest running jobs can run only 7 days, maybe our limits
 never get reduced as you describe?



 The accumulated usage is reduced every 5 minutes (by default; see
 PriorityCalcPeriod).  The reduction is done by multiplying the
 accumulated usage by a number slightly less than 1.  The number is
 chosen so that the accumulated usage is reduced to 50 % after
 PriorityDecayHalfLife (given that you don't run anything more in
 between, of course).  With a halflife of 14 days and the default calc
 period, that number is very close to 1 (0.9998281 if my calculations are
 correct :).

 Note: I read all about these details on the schedmd web pages some years
 ago.  I cannot find them again (the parts about the multiplication with
 a number smaller than 1 to get the half life), so I might be wrong on
 some of the details.

 BTW, I've written a handy script for displaying user limits in a
 readable format:
 https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits



 Nice!

 --
 B/H














Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Christopher Benjamin Coffey
Yup, go to the source code for the specifics lol! Thanks :)

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 2/22/22, 10:10 AM, "slurm-users on behalf of Renfro, Michael" 
 wrote:

For later reference, [1] should be the (current) authoritative source on 
data types for the job_desc values: some strings, some numbers, some booleans.

[1] 
https://github.com/SchedMD/slurm/blob/4c21239d420962246e1ac951eda90476283e7af0/src/plugins/job_submit/lua/job_submit_lua.c#L450

From:
slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Tuesday, February 22, 2022 at 11:02 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



Hi Greg,

Thank you! The key was to use integer boolean instead of true/false. It 
seems this is inconsistent for job_desc elements as some use true/false. Have a 
great one!

Best,
Chris

--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



On 2/18/22, 9:09 PM, "slurm-users on behalf of Greg Wickham" 
 
wrote:

Hi Chris,

You mentioned “But trials using this do not seem to be fruitful so 
far.” . . why?

In our job_submit.lua there is:


if job_desc.shared == 0 then
  slurm.user_msg("exclusive access is not permitted with GPU jobs.")
  slurm.user_msg("Remove '--exclusive' from your job submission 
script")
  return ESLURM_NOT_SUPPORTED
end

and testing:

$ srun --exclusive --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: error: exclusive access is not permitted with GPU jobs.
srun: error: Remove '--exclusive' from your job submission script
srun: error: Unable to allocate resources: Requested operation is 
presently disabled

In slurm.h the job_descriptor struct has:

uint16_t shared;/* 2 if the job can only share nodes 
with other
 *   jobs owned by that user,
 * 1 if job can share nodes with other 
jobs,
 * 0 if job needs exclusive access to 
the node,
 * or NO_VAL to accept the system 
default.
 * SHARED_FORCE to eliminate user 
control. */

If there’s a case where using “.shared” isn’t working please let us 
know.

   -Greg


    From: slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Saturday, 19 February 2022 at 3:17 am
To: slurm-users 
Subject: [EXTERNAL] [slurm-users] Can job submit plugin detect 
"--exclusive" ?

Hello!

The job_submit plugin doesn't appear to have a way to detect whether a 
user requested "--exclusive". Can someone confirm this? Going through the code: 
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. 
Potentially "shared" could be
 possible
 in some way. But trials using this do not seem to be fruitful so far.

If a user requests --exclusive, I'd like to append "--exclude=" 
on to their job request to keep them off of certain nodes. For instance, we 
have our gpu nodes in a default partition with a high priority so that jobs 
don't land on them until last.
 And
 this is the same for our highmem nodes. Normally this works fine, but 
if someone asks for "--exclusive" this will land on these nodes quite often 
unfortunately.

Any ideas? Of course, I could take these nodes out of the partition, 
yet I'd like to see if something like this would be possible.

Thanks! :)

Best,
Chris

--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167










Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Christopher Benjamin Coffey
Hi Greg,

Thank you! The key was to use integer boolean instead of true/false. It seems 
this is inconsistent for job_desc elements as some use true/false. Have a great 
one!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 2/18/22, 9:09 PM, "slurm-users on behalf of Greg Wickham" 
 
wrote:

Hi Chris,

You mentioned “But trials using this do not seem to be fruitful so far.” . 
. why?

In our job_submit.lua there is:


if job_desc.shared == 0 then
  slurm.user_msg("exclusive access is not permitted with GPU jobs.")
  slurm.user_msg("Remove '--exclusive' from your job submission script")
  return ESLURM_NOT_SUPPORTED
end

and testing:

$ srun --exclusive --time 00:10:00 --gres gpu:1 --pty /bin/bash -i 
srun: error: exclusive access is not permitted with GPU jobs.
srun: error: Remove '--exclusive' from your job submission script
srun: error: Unable to allocate resources: Requested operation is presently 
disabled

In slurm.h the job_descriptor struct has:

uint16_t shared;/* 2 if the job can only share nodes with 
other
 *   jobs owned by that user,
 * 1 if job can share nodes with other jobs,
 * 0 if job needs exclusive access to the 
node,
 * or NO_VAL to accept the system default.
 * SHARED_FORCE to eliminate user control. 
*/

If there’s a case where using “.shared” isn’t working please let us know.

   -Greg


From: slurm-users  on behalf of 
Christopher Benjamin Coffey 
Date: Saturday, 19 February 2022 at 3:17 am
To: slurm-users 
Subject: [EXTERNAL] [slurm-users] Can job submit plugin detect 
"--exclusive" ?

Hello!

The job_submit plugin doesn't appear to have a way to detect whether a user 
requested "--exclusive". Can someone confirm this? Going through the code: 
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. 
Potentially "shared" could be possible
 in some way. But trials using this do not seem to be fruitful so far.

If a user requests --exclusive, I'd like to append "--exclude=" on 
to their job request to keep them off of certain nodes. For instance, we have 
our gpu nodes in a default partition with a high priority so that jobs don't 
land on them until last. And
 this is the same for our highmem nodes. Normally this works fine, but if 
someone asks for "--exclusive" this will land on these nodes quite often 
unfortunately.

Any ideas? Of course, I could take these nodes out of the partition, yet 
I'd like to see if something like this would be possible.

Thanks! :)

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167







[slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-18 Thread Christopher Benjamin Coffey
Hello!

The job_submit plugin doesn't appear to have a way to detect whether a user 
requested "--exclusive". Can someone confirm this? Going through the code: 
src/plugins/job_submit/lua/job_submit_lua.c I don't see anything related. 
Potentially "shared" could be possible in some way. But trials using this do 
not seem to be fruitful so far.

If a user requests --exclusive, I'd like to append "--exclude=" on to 
their job request to keep them off of certain nodes. For instance, we have our 
gpu nodes in a default partition with a high priority so that jobs don't land 
on them until last. And this is the same for our highmem nodes. Normally this 
works fine, but if someone asks for "--exclusive" this will land on these nodes 
quite often unfortunately.

Any ideas? Of course, I could take these nodes out of the partition, yet I'd 
like to see if something like this would be possible.

Thanks! :)

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] Issues upgrading db from 20.11.7 -> 21.08.4

2022-02-04 Thread Christopher Benjamin Coffey
Hello!

I figured it out it, was a disk space issue. I thought I had checked this 
already. Please disregard! Thank you!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 2/4/22, 11:41 AM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hello!

I'm trying to test an upgrade of our production slurm db on a test cluster. 
Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a 
dump of the production db, and imported as normal. Then firing up slurmdbd to 
perform the conversion. I've verified everything I can think of but I'm 
thinking maybe I'm missing a timeout related mariadb tweak or something to 
prevent the db from "going away" during the conversion.. See the slurmdbd log 
below ... I've tried doing the upgrade both ways, via the systemd start script, 
and manually starting slurmdbd by hand. Anyone run into this before? 

Here are my innodb.conf settings:

[mysqld]
innodb_buffer_pool_size=1M
innodb_log_file_size=64M
innodb_lock_wait_timeout=1
max_allowed_packet=16M
net_read_timeout=1
connect_timeout=1

===
[root@storm mariadb]# time slurmdbd -D -vvv
slurmdbd: WARNING: MessageTimeout is too high for effective fault-tolerance
slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called 
for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL 
server version is: 5.5.5-10.3.28-MariaDB
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_buffer_pool_size: 10737418240
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_log_file_size: 67108864
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_lock_wait_timeout: 1
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: 
pre-converting usage table for monsoon
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_day_table'
alter table "monsoon_usage_day_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_hour_table'
alter table "monsoon_usage_hour_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_month_table'
alter table "monsoon_usage_month_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: 
pre-converting job table for monsoon
slurmdbd: adding column container after consumed_energy in table 
"monsoon_step_table"
slurmdbd: adding column submit_line after req_cpufreq_gov in table 
"monsoon_step_table"
slurmdbd: debug:  Table "monsoon_step_table" has changed.  Updating...
slurmdbd: error: mysql_query failed: 2013 Lost connection to MySQL server 
during query
alter table "monsoon_step_table" modify `job_db_inx` bigint unsigned not 
null, modify `deleted` tinyint default 0 not null, modify `exit_code` int 
default 0 not null, modify `id_step` int not null, modify `step_het_comp` int 
unsigned default 0xfffe not null, modify `kill_requid` int default -1 not 
null, modify `nodelist` text not null, modify `nodes_alloc` int unsigned not 
null, modify `node_inx` text, modify `state` smallint unsigned not null, modify 
`step_name` text not null, modify `task_cnt` int unsigned not null, modify 
`task_dist` int default 0 not null, modify `time_start` bigint unsigned default 
0 not null, modify `time_end` bigint unsigned default 0 not null, modify 
`time_suspended` bigint unsigned default 0 not null, modify `user_sec` bigint 
unsigned default 0 not null, modify `user_usec` int unsigned default 0 not 
null, modify `sys_sec` bigint unsigned default 0 not null, modify `sys_usec` 
int unsigned default 0 not null, modify `act_cpufreq` double unsigned default 
0.0 not null, modify `consumed_energ

[slurm-users] Issues upgrading db from 20.11.7 -> 21.08.4

2022-02-04 Thread Christopher Benjamin Coffey
Hello!

I'm trying to test an upgrade of our production slurm db on a test cluster. 
Specifically I'm trying to verify a update from 20.11.7 to 21.08.4. I have a 
dump of the production db, and imported as normal. Then firing up slurmdbd to 
perform the conversion. I've verified everything I can think of but I'm 
thinking maybe I'm missing a timeout related mariadb tweak or something to 
prevent the db from "going away" during the conversion.. See the slurmdbd log 
below ... I've tried doing the upgrade both ways, via the systemd start script, 
and manually starting slurmdbd by hand. Anyone run into this before? 

Here are my innodb.conf settings:

[mysqld]
innodb_buffer_pool_size=1M
innodb_log_file_size=64M
innodb_lock_wait_timeout=1
max_allowed_packet=16M
net_read_timeout=1
connect_timeout=1

===
[root@storm mariadb]# time slurmdbd -D -vvv
slurmdbd: WARNING: MessageTimeout is too high for effective fault-tolerance
slurmdbd: debug:  Log file re-opened
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for 
db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL 
server version is: 5.5.5-10.3.28-MariaDB
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_buffer_pool_size: 10737418240
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_log_file_size: 67108864
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: 
innodb_lock_wait_timeout: 1
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: 
pre-converting usage table for monsoon
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_day_table'
alter table "monsoon_usage_day_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_hour_table'
alter table "monsoon_usage_hour_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 
'monsoon_usage_month_table'
alter table "monsoon_usage_month_table" change resv_secs plan_secs bigint 
unsigned default 0 not null;
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_alter_query: The 
database appears to have been altered by a previous upgrade attempt, continuing 
with upgrade.
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: 
pre-converting job table for monsoon
slurmdbd: adding column container after consumed_energy in table 
"monsoon_step_table"
slurmdbd: adding column submit_line after req_cpufreq_gov in table 
"monsoon_step_table"
slurmdbd: debug:  Table "monsoon_step_table" has changed.  Updating...
slurmdbd: error: mysql_query failed: 2013 Lost connection to MySQL server 
during query
alter table "monsoon_step_table" modify `job_db_inx` bigint unsigned not null, 
modify `deleted` tinyint default 0 not null, modify `exit_code` int default 0 
not null, modify `id_step` int not null, modify `step_het_comp` int unsigned 
default 0xfffe not null, modify `kill_requid` int default -1 not null, 
modify `nodelist` text not null, modify `nodes_alloc` int unsigned not null, 
modify `node_inx` text, modify `state` smallint unsigned not null, modify 
`step_name` text not null, modify `task_cnt` int unsigned not null, modify 
`task_dist` int default 0 not null, modify `time_start` bigint unsigned default 
0 not null, modify `time_end` bigint unsigned default 0 not null, modify 
`time_suspended` bigint unsigned default 0 not null, modify `user_sec` bigint 
unsigned default 0 not null, modify `user_usec` int unsigned default 0 not 
null, modify `sys_sec` bigint unsigned default 0 not null, modify `sys_usec` 
int unsigned default 0 not null, modify `act_cpufreq` double unsigned default 
0.0 not null, modify `consumed_energy` bigint unsigned default 0 not null, add 
`container` text after consumed_energy, modify `req_cpufreq_min` int unsigned 
default 0 not null, modify `req_cpufreq` int unsigned default 0 not null, 
modify `req_cpufreq_gov` int unsigned default 0 not null, add `submit_line` 
text after req_cpufreq_gov, modify `tres_alloc` text not null default '', 
modify `tres_usage_in_ave` text not null default '', modify `tres_usage_in_max` 
text not null default '', modify `tres_usage_in_max_taskid` text not null 
default '', modify `tres_usage_in_max_nodeid` text not null 

[slurm-users] Another batch script archiver solution

2021-10-05 Thread Christopher Benjamin Coffey
Howdy,

With the release of 21.08 series of slurm, we now have the ability to archive 
batch scripts within slurm. Yeah, thanks! This is very cool and handy, yet 
before this feature was added to slurm, we developed another option that may be 
of interest to you. In my opinion, it’s a better one as it does not clutter 
your db with who knows what that users could be submitting in their jobscripts.

Have a look at our job script archiver: https://github.com/nauhpc/job_archive 
as it could be a better solution for you!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] Hidden partition visibility issue

2021-02-03 Thread Christopher Benjamin Coffey
Hah, sigh yes that would be just fine ... *blushes*. Thanks! __

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 1/21/21, 11:10 PM, "slurm-users on behalf of Marcus Wagner" 
 
wrote:

Hi Christopher,

doesn't it suffice to use the "-a" option, e.g. "sinfo -s -a" or "squeue 
-a"?
The admins coudl create an alias for that.

Best
Marcus

Am 21.01.2021 um 19:15 schrieb Christopher Benjamin Coffey:
> Hi,
> 
> It doesn't appear to be possible to hide a partition from all normal 
users, but allow for the slurm admins and condo users to still see. While a 
partition is hidden, it is required to use "sudo" to see the partition even 
from a slurm admin. This behavior is seen while adding the following to the 
partition declaration: AllowAccounts=, and AllowGroups=
> 
> We will be having a number of condo partitions, and would be nice if only 
the slurm admins, and condo owner could see the partition only.
> 
> Anyone have a work around? 
> 
> Best,
> Chris
>   
> 

-- 
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ




[slurm-users] Hidden partition visibility issue

2021-01-21 Thread Christopher Benjamin Coffey
Hi,

It doesn't appear to be possible to hide a partition from all normal users, but 
allow for the slurm admins and condo users to still see. While a partition is 
hidden, it is required to use "sudo" to see the partition even from a slurm 
admin. This behavior is seen while adding the following to the partition 
declaration: AllowAccounts=, and AllowGroups=

We will be having a number of condo partitions, and would be nice if only the 
slurm admins, and condo owner could see the partition only.

Anyone have a work around? 

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] Getting --gpus -request in job_submit.lua

2020-11-25 Thread Christopher Benjamin Coffey
Hi Niels,

Have you found a solution? I just noticed this recently as well. We've 
traditionally told our users to use --gres:gpu:tesla:# for requesting gpus. 
Then, our job submit plugin would detect the gres ask, specifically gpu, and 
set a a qos, and partition accordingly. Unforutnately I started pushing folks 
to use -G1, or --gpus=1 for simplicity and just realized our plugin does not 
pick up gpu stuff anymore. Looking at the docs here:

https://slurm.schedmd.com/job_submit_plugins.html

The lua portion says that the function: " _get_job_req_field()" should 
highlight the attributes available. Yet, the gpu request specifics don't appear 
to be there in the code:

https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/lua/job_submit_lua.c

Here is hoping slurm devs can add them, or point to the correct attributes to 
use. I did try "gpus_per_task" but that didn't work.

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 9/30/20, 6:13 AM, "slurm-users on behalf of Niels Carl Hansen" 
 wrote:

I am trying to retrieve the number of requested GPUs in job_submit.lua

If the job is submitted with a --gres -flag, as in "sbatch 
--gres=gpu:2...", I can get the
information in job_submit.lua via the variable 'job_desc.tres_per_node'.

But if the job is submitted with the --gpus -flag, as in "sbatch 
--gpus=2", then 'job_desc.tres_per_node'
is nil.

How can I dig out the number of requested GPUs in job_submit.lua in the 
latter case?
I am running Slurm 20.02.5.

Thanks in advance.

Niels Carl Hansen
Aarhus University, Denmark




Re: [slurm-users] Reserving a GPU (Christopher Benjamin Coffey)

2020-11-02 Thread Christopher Benjamin Coffey
Hi All,

Anyone know if its possible yet to reserve a gpu?  Maybe in 20.02? Thanks!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 5/19/20, 3:04 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hi Lisa,

Im actually referring to the ability to create a reservation that includes 
a gpu resource. It doesn't seem to be possible, which seems strange. This would 
be very helpful for us to have a floating gpu reservation.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



On 5/19/20, 1:47 PM, "slurm-users on behalf of Lisa Kay Weihl" 
 wrote:


I am a newbie at the Slurm setup but if by reservable you also mean a 
consumable resource I am able to request gpus and I have Slurm 20.02.1 and cuda 
10.2.  I just set this up within the last month.




***
Lisa Weihl Systems Administrator
Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu

https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.bgsu.edu%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=yH7Kkzq4j%2BY0Ai%2F1A3f2wwDFa4tMs%2F8GmD7Dxlj48pw%3Dreserved=0​











Message: 1
Date: Tue, 19 May 2020 18:19:26 +
    From: Christopher Benjamin Coffey 
To: Slurm User Community List 
Subject: Re: [slurm-users] Reserving a GPU
Message-ID: <387dee1d-f060-47c3-afb9-0309684c2...@nau.edu>
Content-Type: text/plain; charset="utf-8"

Hi All,

Can anyone confirm that GPU is still not a reservable resource? It 
doesn't seem to be possible still in 19.05.6. I haven't tried 20.02 series.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



?On 11/11/18, 1:19 AM, "slurm-users on behalf of Chris Samuel" 
 wrote:

    On Tuesday, 6 November 2018 5:30:31 AM AEDT Christopher Benjamin 
Coffey wrote:

> Can anyone else confirm that it is not possible to reserve a GPU? 
Seems a
> bit strange.

This looks like the bug that was referred to previously.



https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5771data=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=spfIw40jttePSibZmjw1y3x6xpTVCrhOytBjsRX%2B1O8%3Dreserved=0
 
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5771data=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=spfIw40jttePSibZmjw1y3x6xpTVCrhOytBjsRX%2B1O8%3Dreserved=0>

Although looking at the manual page for scontrol in the current 
master it only says:

   TRES=
  Comma-separated list of TRES required for the 
reservation. Current
  supported TRES types with reservations are: CPU, 
Node, License and
  BB.

But it's early days yet for that release..

All the best,
Chris
-- 
 Chris Samuel  :  

https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=MSwvLoOLkHeUkqyL4qqP0qsHKFYTmAleGEGNebgu0IY%3Dreserved=0
 
<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=MSwvLoOLkHeUkqyL4qqP0qsHKFYTmAleGEGNebgu0IY%3Dreserved=0>
 
 :  Melbourne, VIC











Re: [slurm-users] Reserving a GPU (Christopher Benjamin Coffey)

2020-05-19 Thread Christopher Benjamin Coffey
Hi Lisa,

Im actually referring to the ability to create a reservation that includes a 
gpu resource. It doesn't seem to be possible, which seems strange. This would 
be very helpful for us to have a floating gpu reservation.

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 5/19/20, 1:47 PM, "slurm-users on behalf of Lisa Kay Weihl" 
 wrote:


I am a newbie at the Slurm setup but if by reservable you also mean a 
consumable resource I am able to request gpus and I have Slurm 20.02.1 and cuda 
10.2.  I just set this up within the last month.




***
Lisa Weihl Systems Administrator
Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu
www.bgsu.edu​











Message: 1
Date: Tue, 19 May 2020 18:19:26 +
From: Christopher Benjamin Coffey 
To: Slurm User Community List 
Subject: Re: [slurm-users] Reserving a GPU
Message-ID: <387dee1d-f060-47c3-afb9-0309684c2...@nau.edu>
Content-Type: text/plain; charset="utf-8"

Hi All,

Can anyone confirm that GPU is still not a reservable resource? It doesn't 
seem to be possible still in 19.05.6. I haven't tried 20.02 series.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



?On 11/11/18, 1:19 AM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On Tuesday, 6 November 2018 5:30:31 AM AEDT Christopher Benjamin Coffey 
wrote:

> Can anyone else confirm that it is not possible to reserve a GPU? 
Seems a
> bit strange.

This looks like the bug that was referred to previously.



https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5771data=02%7C01%7Clweihl%40bgsu.edu%7C6a5b061127d941c3cd3a08d7fc2c3dc2%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637255139257284802sdata=PGpqiUJjqYQNhubSqQIyPHXHAg9RIRK7%2FVZdbc0dfAw%3Dreserved=0
 
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5771=02%7C01%7Cchris.coffey%40nau.edu%7Ca7a7a197eca14ba9676408d7fc35e8e0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255180787129133=PDCubso0GBaEwhM7JCldYtyQFq%2B%2FPpOJPufTGXqIwSs%3D=0>

Although looking at the manual page for scontrol in the current master 
it only says:

   TRES=
  Comma-separated list of TRES required for the 
reservation. Current
  supported TRES types with reservations are: CPU, Node, 
License and
  BB.

But it's early days yet for that release..

All the best,
Chris
-- 
 Chris Samuel  :  

https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Clweihl%40bgsu.edu%7C6a5b061127d941c3cd3a08d7fc2c3dc2%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637255139257294793sdata=ti3q%2BVfAdh2C12VLkYS8HY4LGBu2w7Z60EdOeoyda4Y%3Dreserved=0
 
<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=02%7C01%7Cchris.coffey%40nau.edu%7Ca7a7a197eca14ba9676408d7fc35e8e0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255180787139136=rBn0W0vYesvFb%2BpOf2zI6vDTxfILARhCqstkCOejVq4%3D=0>
 
 :  Melbourne, VIC










Re: [slurm-users] Meaning of assoc_limit_stop

2020-01-09 Thread Christopher Benjamin Coffey
Hi All,

Thought I'd try this one more time. Anyone have "assoc_limit_stop" option in 
use? Care to try explaining what it does exactly? This doesn't really make a 
ton of since as it is said in the man page:

assoc_limit_stop
 If set and a job cannot start due to association limits, 
then do not attempt to initiate any lower priority jobs in that partition.
 Setting  this  can  decrease  system throughput and 
utilization, but avoid potentially starving larger jobs by preventing them from
 launching indefinitely.

Seems it should instead say:

assoc_limit_stop
 If set and a job cannot start due to association limits, 
then do not attempt to initiate any lower priority jobs FROM THAT ASSOCIATION 
in that partition.
 Setting  this  can  decrease  system throughput and 
utilization, but avoid potentially starving larger jobs by preventing them from
 launching indefinitely.

Can anyone confirm that the behavior is in line with my modified explanation? 
Otherwise, I don't see why you'd use that option..

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 
 



[slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Christopher Benjamin Coffey
Hi All,

I wonder if any of you have seen these errors in slurmdbd.log

error: persistent connection experienced an error

When we see these errors, we are seeing job errors with some kind of accounting 
in slurm like:

slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should 
never happen
slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should 
never happen
srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource 
temporarily unavailable

I haven't been able to figure out what makes the slurmdbd get into this 
condition. The slurm controller, and slurmdbd are on the same box, so it's 
increasingly odd that the slurmdbd can't communicate with slurmctld. While we 
figure this out, we have begun restarting slurmctl and slurmdbd every day to 
try and keep them "in sync". 

Anyone seen this? Any thoughts? Maybe the one port shown here by:

sacctmgr list cluster

Becomes overwhelmed at times? We have a range of ports for the controller to be 
contacted on. Maybe the db should try on another port if that’s the issue?

SlurmctldPort=6900-6950

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Christopher Benjamin Coffey
Hey Chris,

Thanks! Ya, my qos name is billybob for testing. I believe I was setting it 
right, but not able to confirm it correctly.

sacctmgr update qos name=billybob set maxjobsaccrueperuser=8 -i

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob 
format=MaxJobsAccruePerUser
MaxJobsAccruePU 
--- 
  8

I guess it's getting set right, but I wonder why its not shown by:

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob
  Name   Priority  GraceTimePreempt   PreemptExemptTime PreemptMode 
   Flags UsageThres UsageFactor   GrpTRES   
GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall   MaxTRES 
MaxTRESPerNode   MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU
 MaxTRESPA MaxJobsPA MaxSubmitPA   MinTRES 
-- -- -- -- --- --- 
 -- --- - 
- - --- - --- - 
-- - --- - - --- 
- - --- - 
  billybob  0   00:00:00 explorato+ cluster 
   1.00 


 

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob 
format=maxjobsaccrueperuser
MaxJobsAccruePU 
--- 
  8

Maybe because that setting is just not included in the default list of settings 
shown? That is counterintuitive to this in the man page for sacctmgr:

show  []
  Display information about the specified entity.  By default, all 
entries are displayed, you can narrow results by specifying
  SPECS in your query.  Identical to the list command.

Thoughts? Thanks!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 10:46 PM, "slurm-users on behalf of Chris Samuel" 
 wrote:

Hi Chris,

On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote:

> What am I missing?

It's just a setting on the QOS, not the user:

csamuel@cori01:~> sacctmgr show qos where name=regular_1 
format=MaxJobsAccruePerUser
MaxJobsAccruePU
---
   2

So any user in that QOS can only have 2 jobs ageing at any one time.

All the best,
Chris
-- 
  Chris Samuel  :  
https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C70ddcd0c108d49a5daba08d77f8fcccb%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637118127915006547sdata=KXW%2B4pHkgymQLBLLbv2PK7bk0Nb0rGOBTd9nvwQR9mU%3Dreserved=0
  :  Berkeley, CA, USA





Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Hmm, after trying this out I'm confused. I don't see the limit placed on the 
qos. Infact, I see that the qos header is missing some other options that are 
available in the man page. Maybe I'm missing an option that enables some of the 
options.

[ddd@siris /home/ddd]$ sacctmgr update qos name=billybob set 
maxjobsaccrueperuser=8 -i
 Modified qos...
  billybob
[ddd@siris /home/ddd ]$ sacctmgr list qos -p|grep billybob
billybob|0|00:00:00|exploratory,free||cluster|||1.00||

[ddd@siris /home/ddd ]$ sacctmgr list qos -p|grep Name
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|

What am I missing?

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:23 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Ahh hah! Thanks Killian!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti" 
 wrote:

Hi Chris,

On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey
 wrote:
> I believe I heard recently that you could limit the number of users 
jobs that accrue age priority points. Yet, I cannot find this option in the man 
pages. Anyone have an idea? Thank you!

It's the *JobsAccrue* options in 
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsacctmgr.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cec2ec92361224b31670908d77f51ef7e%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637117862191747391sdata=3W%2BYSp7LmvoqzEzIjT2iVK3wGDv1hlJ1J7gCgdFmyvs%3Dreserved=0

Cheers,
-- 
Kilian







Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Ahh hah! Thanks Killian!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti" 
 wrote:

Hi Chris,

On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey
 wrote:
> I believe I heard recently that you could limit the number of users jobs 
that accrue age priority points. Yet, I cannot find this option in the man 
pages. Anyone have an idea? Thank you!

It's the *JobsAccrue* options in 
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsacctmgr.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cb66846b949a540ccd0eb08d77f4f2185%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637117850150219525sdata=HRMH7Po4fuHiExnVde%2FPTdM784v91OMHwrROlOVLuf0%3Dreserved=0

Cheers,
-- 
Kilian





[slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Hi,

I believe I heard recently that you could limit the number of users jobs that 
accrue age priority points. Yet, I cannot find this option in the man pages. 
Anyone have an idea? Thank you!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-29 Thread Christopher Benjamin Coffey
Brian, I've actually just started attempting to build slurm 19 on centos 8 
yesterday. As you say, there are packages missing now from repos like:

rpmbuild -ta slurm-19.05.3-2.tar.bz2 --define '%_with_lua 1' --define 
'%_with_x11 1'
warning: Macro expanded in comment on line 22: %_prefix pathinstall 
path for commands, libraries, etc.

warning: Macro expanded in comment on line 29: %_with_lua path  build 
Slurm lua bindings

warning: Macro expanded in comment on line 161: %define 
_unpackaged_files_terminate_build  0

error: Failed build dependencies:
pkgconfig(lua) >= 5.1.0 is needed by slurm-19.05.3-2.el8.x86_64
python is needed by slurm-19.05.3-2.el8.x86_64

python (they have python2, and python3 now)
pkgconfig(lua)

I was thinking on putting work into getting the spec file up to snuff, or 
building from source without the spec file.

Brian, what route are you taking? Building rpms, or just building with 
configure, make/make install?

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 10/28/19, 12:18 PM, "slurm-users on behalf of Brian Andrus" 
 wrote:

I spoke too soon.
While I can successfully build/run slurmctld, slurmd is failing because ALL 
of the SelectType libraries are missing symbols.
Example from select_cons_tres.so:
# slurmd
slurmd: error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/select_cons_tres.so): 
/usr/lib64/slurm/select_cons_tres.so:
undefined symbol: powercap_get_cluster_current_cap
slurmd: error: Couldn't load specified plugin name for select/cons_tres: 
Dlopen of plugin file failed
slurmd: fatal: Can't find plugin for select/cons_tres


# nm -D /usr/lib64/slurm/libslurmfull.so|grep powercap_
0010f7b8 T slurm_free_powercap_info_msg
00060060 T slurm_print_powercap_info_msg

So, sure enough powercap_get_cluster_current_cap is not in there.
Methinks the linking needs examined.

Brian Andrus


On 10/28/2019 2:32 AM, Benjamin Redling wrote:


On 28/10/2019 08.26, Bjørn-Helge Mevik wrote:
Taras Shapovalov  
 writes:

Do I understand correctly that Slurm19 is not compatible with rhel8? It is
not in the list https://slurm.schedmd.com/platforms.html 


It says

"RedHat Enterprise Linux 7 (RHEL7), CentOS 7, Scientific Linux 7 (and 
newer)"

Perhaps that includes RHEL8, and CentOS 8, not only Scientific Linux 8?


AFAIK there won't be a Scientific Linux 8 (by Fermilab):

https://listserv.fnal.gov/scripts/wa.exe?A2=SCIENTIFIC-LINUX-ANNOUNCE;11d6001.1904
 


So it seems if there aren't any other maintainers taking care of a
potential SL8 and "and newer" was written intentionally it has to be
RHEL or CentOS 8.

Regards,
Benjamin






Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-29 Thread Christopher Benjamin Coffey
Hi Marcus, yes we are talking about the jobacct_gather/cgroup plugin. Yes, if 
you want cgroups you need:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

But that doesn't mean you have to run the jobacct_gather/cgroup plugin, you 
have the option to use jobacct_gather/linux instead. Per the devs at SLUG19, 
they say the linux version has less overhead than the cgroup variant. I made a 
switch in production and haven't really seen any differences except for seeing 
extra info on the extern step which is a bonus I guess.

It would be nice to have some more clarification from other sites, or devs on 
this.

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 10/25/19, 6:15 AM, "slurm-users on behalf of Marcus Wagner" 
 
wrote:

Aren't we here talking about the job accounting gather plugin?

I would say, you have to use cgroups, together with 
ProctrackType=proctrack/cgroup and TaskPlugin=task/cgroup if you want ot 
use the jobacct_gather/cgroup plugin, because elsewise SLURMdoes not 
pack the jobs into cgroups.

Best
Marcus

On 10/25/19 1:48 AM, Brian Andrus wrote:
> IIRC, the big difference is if you want to use cgroups on the nodes. 
> You must use the cgroup plugin.
>
> Brian Andrus
>
> On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote:
>> Hi Juergen,
>>
>>  From what I see so far, there is nothing missing from the 
>> jobacct_gather/linux plugin vs the cgroup version. In fact, the 
>> extern step now has data where as it is empty when using the cgroup 
>> version.
>>
>> Anyone know the differences?
>>
>> Best,
>> Chris
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de

https://nam05.safelinks.protection.outlook.com/?url=www.itc.rwth-aachen.dedata=02%7C01%7Cchris.coffey%40nau.edu%7Ce7877bc1ea7943061a6708d7594d5d07%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637076061119561065sdata=Oaib%2F4nWtIlj1JhocXmDc2yByg9bDnquqVmeT6ZKTpU%3Dreserved=0






Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Christopher Benjamin Coffey
Hi Juergen,

From what I see so far, there is nothing missing from the jobacct_gather/linux 
plugin vs the cgroup version. In fact, the extern step now has data where as it 
is empty when using the cgroup version. 

Anyone know the differences?

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 10/22/19, 10:52 AM, "slurm-users on behalf of Juergen Salk" 
 
wrote:

Dear Chris,

I could not find this warning in the slurm.conf man page. So I googled
it and found a reference in the Slurm developers documentation: 


https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fjobacct_gatherplugins.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cd82fa0e7b1b444f33d1608d757188d36%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637073635277184549sdata=54t98dF9mAbR7bRmeiyF0OUN3dPULWKVoG08H7Y3TtY%3Dreserved=0

However, this web page says in its footer: "Last modified 27 March 2015". 
So maybe (means: hopefully) this caveat is somewhat outdated today. 

I have also `JobAcctGatherType=jobacct_gather/cgroup´ in my slurm.conf 
but for no deeper reason than that we also use cgroups for
process tracking (i.e. ProctrackType=proctrack/cgroup) and to limit 
resources used by users. So it just felt more consistent to me to 
use cgroups for jobacct_gather plugin as well - even though SchedMD 
recommends jobacct_gather/linux (according to the slurm.conf man page)

That said, I'd also be interested in the pros and cons of 
jobacct_gather/cgroup 
versus jobacct_gather/linux and also why jobacct_gather/linux is the 
recommended
one.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

    

    

    * Christopher Benjamin Coffey  [191022 16:26]:
> Hi,
> 
> We've been using jobacct_gather/cgroup for quite some time and haven't 
had any issues (I think). We do see some lengthy job cleanup times when there 
are lots of small jobs completing at once, maybe that is due to the cgroup 
plugin. At SLUG19 a slurm dev presented information that the 
jobacct_gather/cgroup plugin has quite the performance hit and that 
jobacct_gather/linux should be set instead. 
> 
> Can someone help me with the difference between these two gather plugins? 
If one were to switch to jobacct_gather/linux, what are the cons? Do you lose 
some job resource usage information?
> 
> Checking out the docs again on schedmd site regarding the jobacct_gather 
plugins I see:
> 
> cgroup — Gathers information from Linux cgroup infrastructure and adds 
this information to the standard rusage information also gathered for each job. 
(Experimental, not to be used in production.)
> 
> I don't believe I saw that before: "Experimental" ! Hah.
> 
> Thanks!
> 
> Best,
> Chris
>  
> -- 
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
>  
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A





Re: [slurm-users] One time override to force run job

2019-09-04 Thread Christopher Benjamin Coffey
Hi Tina,

I think you could just have a qos called "override" that has no limits, or 
maybe just high limits. Then, just modify the job's qos to be "override" with 
scontrol. Based on your setup, you may also have to update the jobs account to 
an "override" type account with no limits.

We do this from time to time.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 9/2/19, 12:47 PM, "slurm-users on behalf of Tina Fora" 
 wrote:

Hello,

Is there a way to force a job to run that is being held back for
QOSGrpCpuLimit? This is coming from QOS that we have in place. For the
most part it works great but every once in a while we have free nodes that
are idle and I'd like to force the job to run.

Tina






Re: [slurm-users] Slurm Feature Poll

2019-08-28 Thread Christopher Benjamin Coffey
Hi Paul,

I submitted the poll - thanks! For bug #7609, while I'd be happier with a built 
in slurm solution, you may find that our jobscript archiver implementation 
would work nicely for you. It is very high-performing and has no effect on the 
scheduler, or db performance. 

The solution is a multithreaded c++ program which starts 1 thread for each 
/var/spool/slurm/hash.N directory. Each thread subscribes to inotify filesystem 
change events and when a new job directory shows up under hash.N, the program 
copies the jobscript file, and environment file to a local archive directory, 
at the same time creating user based ACLs on the files/dirs for security. We 
then have a cron that moves the jobscripts to a NFS share from which users can 
grab their jobscripts if desired. For our model, we wanted only the admins, and 
the user that submitted the script to have access to the jobscripts. Thus the 
reason for the ACLs on the files/dirs. 

We tried a slurmctld_prolog solution initially to archive jobs, but impacted 
scheduler performance dramatically.

We have been very happy with it. Check it out, if you find it useful let me 
know!

https://github.com/nauhpc/job_archive

If you have any questions, please let me know!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/28/19, 7:25 AM, "slurm-users on behalf of Paul Edmon" 
 
wrote:

We have several pending feature requests to SchedMD regarding different 
features we would like to see, as I am sure many other groups have.  We 
were curious if anyone else in the community is interested in these 
features and if your group would be interested in talking with us 
(Harvard FAS Research Computing) about getting these implemented 
(possibly through some coalition or collaboration).  Please let us know 
which features you are interested in the poll below and then please list 
your University or organization and contact person.  If you don't want 
to send your preferred contact info just post your organization and we 
will reach out to you.  Thanks in advance.


https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fforms.gle%2FLGWLuu9b3bRcihHs7data=02%7C01%7Cchris.coffey%40nau.edu%7C53f4b79669d1455b7a5b08d72bc3a322%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637025991574652242sdata=8yOILmYvqUwCVvZNpiHmZdSdAMgVyduGRqbuStXvANQ%3Dreserved=0

-Paul Edmon-

FAS Research Computing

Harvard University






[slurm-users] Node resource is under-allocated

2019-08-27 Thread Christopher Benjamin Coffey
Hi,

Can someone help me understand what this error is?

select/cons_res: node cn95 memory is under-allocated (125000-135000) for 
JobId=23544043

We get a lot of these from time to time and I don't understand what its about?

Looking at the code it doesn't make sense for this to be happening on running 
jobs.

plugins/select/cons_res/select_cons_res.c

/*
 * deallocate resources previously allocated to the given job
 * - subtract 'struct job_resources' resources from 'struct part_res_record'
 * - subtract job's memory requirements from 'struct node_res_record'
 *
 * if action = 0 then subtract cores, memory + GRES (running job was terminated)
 * if action = 1 then subtract memory + GRES (suspended job was terminated)
 * if action = 2 then only subtract cores (job is suspended)
 */
static int _rm_job_from_res(struct part_res_record *part_record_ptr,
struct node_use_record *node_usage,
struct job_record *job_ptr, int action)

...
if (action != 2) {
if (node_usage[i].alloc_memory <
job->memory_allocated[n]) {
error("%s: node %s memory is under-allocated 
(%"PRIu64"-%"PRIu64") for %pJ",
  plugin_type, node_ptr->name,
  node_usage[i].alloc_memory,
  job->memory_allocated[n],
  job_ptr);
node_usage[i].alloc_memory = 0;
} else
node_usage[i].alloc_memory -=
job->memory_allocated[n];
}
...

It appears to me that the function should be called when basically a job has 
ended or suspended. Yet, these errors are being printed for running jobs. Is 
slurm actually deallocating resources for that job? And thus there is more 
memory that could be used for other jobs? I don't think that is the case.

Anyone have a thought here?

My initial feeling is .. Who cares if the node is under-allocated? Yes, it 
would be great if the user actually comes close to using the memory/resource 
they asked for so that it is not wasted, but this typically doesn't happen. Is 
this error there to let sysadmins know that maybe you should overprovision the 
memory? Or maybe there is a config issue on our side? I don't think the latter 
is the case.

Thanks!

Best,
Chris


—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] exclusive or not exclusive, that is the question

2019-08-21 Thread Christopher Benjamin Coffey
Marcus, maybe you can try playing with --mem instead? We recommend our users to 
use --mem instead of --mem-per-cpu/task as it It makes it easier for users to 
request the right amount of memory for the job. --mem is the amount of memory 
for the whole job. This way, there is no multiplying of memory * cpu involved. 

Strange that the cgroup has more memory than possible though.
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/20/19, 11:27 PM, "slurm-users on behalf of Marcus Wagner" 
 
wrote:

One thing, I forgot.

On 8/20/19 4:58 PM, Christopher Benjamin Coffey wrote:
> Hi Marcus,
>
> What is the reason to add "--mem-per-cpu" when the job already has 
exclusive access to the node?
The user (normally) does not set --exclusive directly. We have several 
accounts, whose jobs by default should run exclusively, so we set that 
in the job_submit plugin.
>   Your job has access to all of the memory, and all of the cores on the 
system already. Also note, for non-mpi code like single core job, or shared 
memory threaded job, you want to ask for number of cpus with --cpus-per-task, 
or -c. Unless you are running mpi code, where you will want to use -n, and 
--ntasks instead to launch n copies of the code on n cores. In this case, 
because you asked for -n2, and also specified a mem-per-cpu request, the 
scheduler is doling out the memory as requested (2 x tasks), likely due to 
having SelectTypeParameters=CR_Core_Memory in slurm.conf.
I must say, we would be much happier with a --mem-per-task option 
instead. I still do not know, why one should ask for mem-per-cpu 
(logically), since in a shared memory job, you start one process, the 
threads share the memory.
With an hybrid MPI-code (mpi code with openmp parallelization on the 
tasks), it makes even less sense. If I know, how much memory my tasks 
needs, e.g. 10 GB, I still have to divide that through the number of 
threads (-c) to get the right memory request. For me as an 
administrator, an openmp job is a special hybrid job with only one 
requested task. So it is the same for a shared memory job. I always have 
to divide the really needed memory through the number of threads (or 
cpus-per-task).

Is there anyone, who can enlighten me?
Why does one have to ask for memory per smallest scheduleable (is that 
word right?) unit? Isn't it better to ask for memory per task/process?


Best
Marcus
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>   
>
> On 8/20/19, 1:37 AM, "slurm-users on behalf of Marcus Wagner" 
 
wrote:
>
>  Just made another test.
>  
>  
>  Thanks god, the exclusivity is not "destroyed" completely, only on 
job
>  can run on the node, when the job is exclusive. Nonetheless, this is
>  somewhat unintuitive.
>  I wonder, if that also has an influence on the cgroups and the 
process
>  affinity/binding.
>  
>  I will do some more tests.
>  
>  
>  Best
>  Marcus
>  
>  On 8/20/19 9:47 AM, Marcus Wagner wrote:
>  > Hi Folks,
>  >
>  >
>  > I think, I've stumbled over a BUG in Slurm regarding the
>  > exclusiveness. Might also, I've misinterpreted something. I would 
be
>  > happy, if someone could explain that to me in the latter case.
>  >
>  > To the background. I have set PriorityFlags=MAX_TRES
>  > The TRESBillingWeights are "CPU=1.0,Mem=0.1875G" for a partition 
with
>  > 48 core nodes and RealMemory 187200.
>  >
>  > ---
>  >
>  > I have two jobs:
>  >
>  > job 1:
>  > #SBATCH --exclusive
>  > #SBATCH --ntasks=2
>  > #SBATCH --nodes=1
>  >
>  > scontrol show  =>
>  >NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>  >TRES=cpu=48,mem=187200M,node=1,billing=48
>  >
>  > exactly, what I expected, I got 48 CPUs and therefore the billing 
is 48.
>  >
>  > ---
>  >
>  > job 2 (just added mem-per-cpu):
>  > #SBATCH --exclusive
>  > #SBATCH --ntasks=2
>  > #SBATCH --nodes=1
>  > #SBATCH --mem-per-cpu=5000
>  >
>  > scontrol show  =>
>  >NumNodes=1-1 NumCPUs=2 Nu

Re: [slurm-users] exclusive or not exclusive, that is the question

2019-08-20 Thread Christopher Benjamin Coffey
Hi Marcus,

What is the reason to add "--mem-per-cpu" when the job already has exclusive 
access to the node? Your job has access to all of the memory, and all of the 
cores on the system already. Also note, for non-mpi code like single core job, 
or shared memory threaded job, you want to ask for number of cpus with 
--cpus-per-task, or -c. Unless you are running mpi code, where you will want to 
use -n, and --ntasks instead to launch n copies of the code on n cores. In this 
case, because you asked for -n2, and also specified a mem-per-cpu request, the 
scheduler is doling out the memory as requested (2 x tasks), likely due to 
having SelectTypeParameters=CR_Core_Memory in slurm.conf.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/20/19, 1:37 AM, "slurm-users on behalf of Marcus Wagner" 
 
wrote:

Just made another test.


Thanks god, the exclusivity is not "destroyed" completely, only on job 
can run on the node, when the job is exclusive. Nonetheless, this is 
somewhat unintuitive.
I wonder, if that also has an influence on the cgroups and the process 
affinity/binding.

I will do some more tests.


Best
Marcus

On 8/20/19 9:47 AM, Marcus Wagner wrote:
> Hi Folks,
>
>
> I think, I've stumbled over a BUG in Slurm regarding the 
> exclusiveness. Might also, I've misinterpreted something. I would be 
> happy, if someone could explain that to me in the latter case.
>
> To the background. I have set PriorityFlags=MAX_TRES
> The TRESBillingWeights are "CPU=1.0,Mem=0.1875G" for a partition with 
> 48 core nodes and RealMemory 187200.
>
> ---
>
> I have two jobs:
>
> job 1:
> #SBATCH --exclusive
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
>
> scontrol show  =>
>NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=48,mem=187200M,node=1,billing=48
>
> exactly, what I expected, I got 48 CPUs and therefore the billing is 48.
>
> ---
>
> job 2 (just added mem-per-cpu):
> #SBATCH --exclusive
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --mem-per-cpu=5000
>
> scontrol show  =>
>NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=2,mem=1M,node=1,billing=2
>
> Why "destroys" '--mem-per-cpu' exclusivity?
>
>
>
> Best
> Marcus
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de

https://nam05.safelinks.protection.outlook.com/?url=www.itc.rwth-aachen.dedata=02%7C01%7Cchris.coffey%40nau.edu%7C4a5803448abd497d7cde08d7254995f2%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637018870287848104sdata=HNuqCBYwrJjBcLGFGYuVKxWe9pqCxt028rrRrJ%2FTYp0%3Dreserved=0






Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Benjamin Coffey
Ya, I saw that it was almost removed before 19.05. I didn't know about the NEWS 
file! Yep its right there, mea culpa; I'll check that in the future!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/15/19, 11:08 AM, "slurm-users on behalf of Christopher Samuel" 
 wrote:

On 8/15/19 11:02 AM, Mark Hahn wrote:

> it's in NEWS, if that counts.  also, I note that at least in this commit,
> --chdir is added but --workdir is not removed from option parsing.

It went away here:

commit 9118a41e13c2dfb347c19b607bcce91dae70f8c6
Author: Tim Wickberg 
Date:   Tue Mar 12 23:20:27 2019 -0600

 Remove --workdir option from sbatch. Deprecated since before 17.11.

-- 
   Chris Samuel  :  
https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C94a440645f9246467cf308d721ab9a9e%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637014893223989474sdata=vuCoIBX%2FbRtM7VLrRCSAIAza4JUGMyuB%2FJ6Eg2nQCQQ%3Dreserved=0
  :  Berkeley, CA, USA





Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Benjamin Coffey
Looks like the commit is here:

https://github.com/SchedMD/slurm/commit/fddc98533c1f3753e5e43ad6a16407c5cb8c8de8

Yet, no change log on it. Very frustrating.

Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/14/19, 1:30 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hmm it seems that a job submit plugin fix will not be possible due to the 
attribute being removed from the api

Am I missing something here?

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/14/19, 12:40 PM, "slurm-users on behalf of Christopher Benjamin 
Coffey"  wrote:

Hi,

It seems that --workdir= is no longer a valid option in batch jobs and 
srun in 19.05, and has been replaced by --chdir. I didn't see a change log 
about this, did I miss it? Going through the man pages it seems it hasn't 
existed for some time now actually! Maybe not since before 17.11 series. When 
did this happen?! I guess I'll have to write a jobsubmit rule to overwrite this 
in the meantime till we get users trained differently.

Anyone else notice this? I can't find this mentioned as a bug, or 
anything on the nets.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 







Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-14 Thread Christopher Benjamin Coffey
Hmm it seems that a job submit plugin fix will not be possible due to the 
attribute being removed from the api

Am I missing something here?

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/14/19, 12:40 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hi,

It seems that --workdir= is no longer a valid option in batch jobs and srun 
in 19.05, and has been replaced by --chdir. I didn't see a change log about 
this, did I miss it? Going through the man pages it seems it hasn't existed for 
some time now actually! Maybe not since before 17.11 series. When did this 
happen?! I guess I'll have to write a jobsubmit rule to overwrite this in the 
meantime till we get users trained differently.

Anyone else notice this? I can't find this mentioned as a bug, or anything 
on the nets.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 





[slurm-users] Slurm 19.05 --workdir non existent?

2019-08-14 Thread Christopher Benjamin Coffey
Hi,

It seems that --workdir= is no longer a valid option in batch jobs and srun in 
19.05, and has been replaced by --chdir. I didn't see a change log about this, 
did I miss it? Going through the man pages it seems it hasn't existed for some 
time now actually! Maybe not since before 17.11 series. When did this happen?! 
I guess I'll have to write a jobsubmit rule to overwrite this in the meantime 
till we get users trained differently.

Anyone else notice this? I can't find this mentioned as a bug, or anything on 
the nets.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] 19.05 and GPUs vs GRES

2019-08-13 Thread Christopher Benjamin Coffey
Thanks for that Chris! :)

Sounds like other than the new requests for gpu specifics, things should just 
work when upgrading to 19.05 as slurm is likely backwards compatible with the 
previous setup gres stuff.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/12/19, 10:28 PM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On Monday, 12 August 2019 11:42:48 AM PDT Christopher Benjamin Coffey wrote:

> Excuse me if this has been explained somewhere, I did some searching. With
> 19.05, is there any reason to have gres.conf on the GPU nodes? Is slurm
> smart enough to enumerate the /dev/nvidia* devices? We are moving to 19.05
> shortly, any gotchas with GRES and GPUs? Also, I'm guessing now, there is
> no reason for users to request "--gres:gpu" type stuff anymore and instead
> use: --gpus=n ?

We do have 19.05 on our GPU nodes, but I've not had time to experiment with 
the new request syntax just yet.

Regarding configuration it does appear to be that you still need to set 
them 
up, but if you link Slurm against the nvidia NVML library at compile time 
then 
there is support for autodetection.


https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cfc6ede93f45440fdaf1508d71faf0362%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637012708851283210sdata=lUrvaHgA4jSVgvlcSd9GJBBOZ8dSWSHSNl9ee%2Bv4Xo0%3Dreserved=0

# In the case of GPUs, if AutoDetect=nvml in gres.conf and the NVML library
# is installed on the node and was present during Slurm configuration, the
# missing configuration details will be automatically gathered using the
# NVML library. Configuration information about all other generic resource
# must explicitly be described in the gres.conf file. 

All the best,
Chris
-- 
  Chris Samuel  :  
https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7Cfc6ede93f45440fdaf1508d71faf0362%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637012708851283210sdata=hnqqFo7C%2FVg60ZmgPZOcianQTcFlcRS5d%2Fl5O4OQCSw%3Dreserved=0
  :  Berkeley, CA, USA








[slurm-users] 19.05 and GPUs vs GRES

2019-08-12 Thread Christopher Benjamin Coffey
Hi,

Excuse me if this has been explained somewhere, I did some searching. With 
19.05, is there any reason to have gres.conf on the GPU nodes? Is slurm smart 
enough to enumerate the /dev/nvidia* devices? We are moving to 19.05 shortly, 
any gotchas with GRES and GPUs? Also, I'm guessing now, there is no reason for 
users to request "--gres:gpu" type stuff anymore and instead use: --gpus=n ?

Thank you!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



[slurm-users] Sshare -l segfaults

2019-07-12 Thread Christopher Benjamin Coffey
Hi All,

Has anyone had issues with sshare segfaulting? Specifically with "sshare -l"? 
Any suggestions on how to figure this one out? Maybe there is something obvious 
I'm not seeing. This has been happening for many slurm versions, I can't recall 
when it started. For the last couple versions I've hoped that the bug would be 
patched but it hasn't.

We are currently running slurm version 18.08.7 with Fair_Tree enabled on Centos 
6.10.

See below output from a "sshare -l":

 snipped ...
  dickson  jja81280.5042050.30  
0.98   0.391061   0.51
cpu=0,mem=0,energy=0,node=0,b+ 
  dickson  lz621280.50   00.00  
0.02   0.392924 2.1169e+05
cpu=0,mem=0,energy=0,node=0,b+
*** glibc detected *** sshare: free(): invalid next size (fast): 
0x013512f0 ***
=== Backtrace: =
/lib64/libc.so.6[0x3ddba75e5e]
/lib64/libc.so.6[0x3ddba78cf0]
/usr/lib64/slurm/libslurmfull.so(slurm_xfree+0x1d)[0x3f5a75f65f]
/usr/lib64/slurm/libslurmfull.so(print_fields_double+0x209)[0x3f5a6b1c3b]
sshare(process+0x55c)[0x40266d]
sshare[0x402abb]
sshare(main+0xa1f)[0x4034f8]
/lib64/libc.so.6(__libc_start_main+0x100)[0x3ddba1ed20]
sshare[0x401e99]
=== Memory map: 
0040-00405000 r-xp  fd:04 4710   
/usr/bin/sshare
00605000-00608000 rw-p 5000 fd:04 4710   
/usr/bin/sshare
012ab000-01372000 rw-p  00:00 0  [heap]
3ddb60-3ddb62 r-xp  fd:00 196
/lib64/ld-2.12.so
3ddb82-3ddb821000 r--p 0002 fd:00 196
/lib64/ld-2.12.so
3ddb821000-3ddb822000 rw-p 00021000 fd:00 196
/lib64/ld-2.12.so
3ddb822000-3ddb823000 rw-p  00:00 0 
3ddba0-3ddbb8b000 r-xp  fd:00 306
/lib64/libc-2.12.so
3ddbb8b000-3ddbd8a000 ---p 0018b000 fd:00 306
/lib64/libc-2.12.so
3ddbd8a000-3ddbd8e000 r--p 0018a000 fd:00 306
/lib64/libc-2.12.so
3ddbd8e000-3ddbd9 rw-p 0018e000 fd:00 306
/lib64/libc-2.12.so
3ddbd9-3ddbd94000 rw-p  00:00 0 
3ddbe0-3ddbe83000 r-xp  fd:00 5655   
/lib64/libm-2.12.so
3ddbe83000-3ddc082000 ---p 00083000 fd:00 5655   
/lib64/libm-2.12.so
3ddc082000-3ddc083000 r--p 00082000 fd:00 5655   
/lib64/libm-2.12.so
3ddc083000-3ddc084000 rw-p 00083000 fd:00 5655   
/lib64/libm-2.12.so
3ddc20-3ddc202000 r-xp  fd:00 2425   
/lib64/libdl-2.12.so
3ddc202000-3ddc402000 ---p 2000 fd:00 2425   
/lib64/libdl-2.12.so
3ddc402000-3ddc403000 r--p 2000 fd:00 2425   
/lib64/libdl-2.12.so
3ddc403000-3ddc404000 rw-p 3000 fd:00 2425   
/lib64/libdl-2.12.so
3ddc60-3ddc617000 r-xp  fd:00 1865   
/lib64/libpthread-2.12.so
3ddc617000-3ddc817000 ---p 00017000 fd:00 1865   
/lib64/libpthread-2.12.so
3ddc817000-3ddc818000 r--p 00017000 fd:00 1865   
/lib64/libpthread-2.12.so
3ddc818000-3ddc819000 rw-p 00018000 fd:00 1865   
/lib64/libpthread-2.12.so
3ddc819000-3ddc81d000 rw-p  00:00 0 
3ddca0-3ddca08000 r-xp  fd:04 160337 
/usr/lib64/libhistory.so.6.0
3ddca08000-3ddcc07000 ---p 8000 fd:04 160337 
/usr/lib64/libhistory.so.6.0
3ddcc07000-3ddcc08000 rw-p 7000 fd:04 160337 
/usr/lib64/libhistory.so.6.0
3ddfa0-3ddfa16000 r-xp  fd:00 5664   
/lib64/libgcc_s-4.4.7-20120601.so.1
3ddfa16000-3ddfc15000 ---p 00016000 fd:00 5664   
/lib64/libgcc_s-4.4.7-20120601.so.1
3ddfc15000-3ddfc16000 rw-p 00015000 fd:00 5664   
/lib64/libgcc_s-4.4.7-20120601.so.1
3de260-3de263a000 r-xp  fd:00 3438   
/lib64/libreadline.so.6.0
3de263a000-3de283a000 ---p 0003a000 fd:00 3438   
/lib64/libreadline.so.6.0
3de283a000-3de2842000 rw-p 0003a000 fd:00 3438   
/lib64/libreadline.so.6.0
3de2842000-3de2843000 rw-p  00:00 0 
3de720-3de721d000 r-xp  fd:00 2401   
/lib64/libtinfo.so.5.7
3de721d000-3de741c000 ---p 0001d000 fd:00 2401   
/lib64/libtinfo.so.5.7
3de741c000-3de742 rw-p 0001c000 fd:00 2401   
/lib64/libtinfo.so.5.7
3de742-3de7421000 rw-p  00:00 0 
3de9e0-3de9e22000 r-xp  fd:00 534
/lib64/libncurses.so.5.7
3de9e22000-3dea021000 

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-20 Thread Christopher Benjamin Coffey
Hi Kevin,

We fixed the issue on github. Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/17/19, 8:56 AM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Thanks Kevin, we'll put a fix in for that.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/17/19, 12:04 AM, "Kevin Buckley"  wrote:

On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:

> Feel free to try it out and let us know how it works for you!
> 
> 
https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archivedata=02%7C01%7Cchris.coffey%40nau.edu%7Ca6a27bca726544fd679a08d6f33c5d01%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636963837976427575sdata=pEcZWw0blF4E%2Bctohgawim055T0NQvt%2FArNKjhcZejs%3Dreserved=0

So Chris,

testing it out quickly, and dirtily, using an sbatch with a here 
document, vis:

$ sbatch -p testq  <

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
Thanks Kevin, we'll put a fix in for that.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/17/19, 12:04 AM, "Kevin Buckley"  wrote:

On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:

> Feel free to try it out and let us know how it works for you!
> 
> 
https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archivedata=02%7C01%7CChris.Coffey%40nau.edu%7C18e6f8f2342944d6190308d6f2f1fdb5%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636963518510475817sdata=zTTffeweFY4oTZ5zDwc06wSIPPnQrrV3bbgvRFdbYIc%3Dreserved=0

So Chris,

testing it out quickly, and dirtily, using an sbatch with a here document, 
vis:

$ sbatch -p testq  <

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
Hi Lech,

I'm glad that it is working out well with the modifications you've put in 
place! Yes, there can be a huge volume of jobscripts out there. That’s a pretty 
good way of keeping it organized! . We've backed up 1.1M jobscripts since its 
inception 1.5 months ago and aren't too worried yet about the inode/space 
usage. We haven't settled in to what we will do to keep the archive clean yet. 
My thought was:

- keep two months (directories) of jobscripts for each user, leaving the 
jobscripts intact for easy user access
- tar up the month directories that are older than two months
- keep four tarred months

That way there would be 6 months of jobscript archive to match our 6 month job 
accounting retention in the slurm db.

I'd be interested in your version however, please do send it along! And please 
keep in touch with how everything goes!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/14/19, 2:22 AM, "slurm-users on behalf of Lech Nieroda" 
 
wrote:

Hello Chris,

we’ve tried out your archiver and adapted it to our needs, it works quite 
well.
The changes:
- we get lots of jobs per day, ca. 3k-5k, so storing them as individual 
files would waste too much inodes and 4k-blocks. Instead everything is written 
into two log files (job_script.log and job_env.log) with the prefix 
„  “ in each line. In this way one can easily grep and 
cut the corresponding job script or environment. Long term storage and 
compression is handled by logrotate, with standard compression settings
- the parsing part can fail to produce a username, thus we have introduced 
a customized environment variable that stores the username and can be read 
directly by the archiver 
- most of the program’s output, including debug output, is handled by the 
logger and stored in a jobarchive.log file with an appropriate timestamp
- the logger uses a va_list to make multi-argument log-oneliners possible
- signal handling is reduced to the debug-level incease/decrease
- file handling is mostly relegated to HelperFn, directory trees are now 
created automatically
- the binary header of the env-file and the binary footer of the 
script-file are filtered, thus the resulting files are recognized as ascii files

If you are interested in our modified version, let me know.

Kind regards,
Lech


> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey 
:
> 
> Hi All,
> 
> We created a slurm job script archiver which you may find handy. We 
initially attempted to do this through slurm with a slurmctld prolog but it 
really bogged the scheduler down. This new solution is a custom c++ program 
that uses inotify to watch for job scripts and environment files to show up out 
in /var/spool/slurm/hash.* on the head node. When they do, the program copies 
the jobscript and environment out to a local archive directory. The program is 
multithreaded and has a dedicated thread watching each hash directory. The 
program is super-fast and lightweight and has no side effects on the scheduler. 
The program by default will apply ACLs to the archived job scripts so that only 
the owner of the jobscript can read the files. Feel free to try it out and let 
us know how it works for you!
> 
> 
https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archivedata=02%7C01%7Cchris.coffey%40nau.edu%7Cce8cb62264b84a21e32608d6f0a9d9be%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C1%7C636961009635679145sdata=k2%2BdZ90EE78r5PQz9GdblaEWIrPoY79T6gwkIcNrxNE%3Dreserved=0
> 
> Best,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 






Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Christopher Benjamin Coffey
Hi, you may want to look into increasing the sssd cache length on the nodes, 
and improving the network connectivity to your ldap directory. I recall when 
playing with sssd in the past that it wasn't actually caching. Verify with 
tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it 
shouldn't be hitting the directory anymore till the cache expires.

Do the nodes NAT through the head node?

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
 
wrote:

Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.

We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer.  If that happens because slurmctld is trying to look up some
user or group, etc, client commands depending on it will hang.  The
default message timeout is 10 seconds, so if the delay is more than
that, you get the timeout error.

We don't know why the delays are happening, but while we are debugging
it, we've increased the MessageTimeout, which seems to have reduced the
problem a bit.  We're also experimenting with GroupUpdateForce and
GroupUpdateTime to reduce the number of times slurmctld needs to ask
about groups, but I'm unsure how much that helps.

-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo




[slurm-users] Slurm Jobscript Archiver

2019-05-09 Thread Christopher Benjamin Coffey
Hi All,

We created a slurm job script archiver which you may find handy. We initially 
attempted to do this through slurm with a slurmctld prolog but it really bogged 
the scheduler down. This new solution is a custom c++ program that uses inotify 
to watch for job scripts and environment files to show up out in 
/var/spool/slurm/hash.* on the head node. When they do, the program copies the 
jobscript and environment out to a local archive directory. The program is 
multithreaded and has a dedicated thread watching each hash directory. The 
program is super-fast and lightweight and has no side effects on the scheduler. 
The program by default will apply ACLs to the archived job scripts so that only 
the owner of the jobscript can read the files. Feel free to try it out and let 
us know how it works for you!

https://github.com/nauhpc/job_archive

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Slurm --mail-to setup help

2019-03-11 Thread Christopher Benjamin Coffey
Chad,

Hah! Just reread the man page.

If you use this: 

MailDomain
Domain name to qualify usernames if email address is not explicitly given with 
the "--mail-user" option. If unset, the local MTA will need to qualify local 
address itself.

Shouldn't need to worry about the .forward stuff if you set that to your 
campus/lab email domain. You'd still need most of the postfix config though. 
Don't think I saw that when I set this up a while back.

Sorry!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 3/11/19, 3:04 PM, "slurm-users on behalf of Julius, Chad" 
 
wrote:

All, 
 
I am new to Slurm and was just wondering if someone has a link or info on 
getting Slurm to send mail to users.  Are you using sendmail, postfix or ???.  
I have been asked to get the --mail-user option working but I am not sure how 
Slurm
 ties into mail.  Does mail have to listen for messages from Slurm?  What 
changes have to be made to slurm.conf? 

 
Any thoughts, help out there?
 
Thanks,
 
Chad 
 





Re: [slurm-users] Slurm --mail-to setup help

2019-03-11 Thread Christopher Benjamin Coffey
Hi Chad,

My memory is a little hazy on how this was setup but ...

man slurm.conf
MailProg
Fully qualified pathname to the program used to send email per user request. 
The default value is "/bin/mail" (or "/usr/bin/mail" if "/bin/mail" does not 
exist but "/usr/bin/mail" does exist).

Slurm is calling /bin/mail which is mailx on our system and sending email 
addressed to slurmu...@headnode.fqdn to the mail server running locally on the 
system. We are using postfix. The postfix server then delivers the email 
locally to users since that’s how its addressed. We have a .forward in each 
user's /home which points to their real email address that is configured when 
the user was created on our system. The .forward file is just a one liner with 
the email address of the user. The email is then forwarded to the users real 
email address on a mail server on campus. 

Our postfix conf is pretty default, but with a few extras that I can't recall 
if they were essential:

inet_interfaces # interface to listen on, likely just localhost
inet_protocols  # possibly just ipv4
relayhost   # your relay smtp server on campus 
myhostname  # your head node's fqdn
mydestination   # domains to accept mail from, likely just the default
relay_domains   # your local domain,  don't recall if that was needed

Hope that helps! Again, my memory is a little hazy on how its setup. Someone 
with a fresher mail setup config may have better details.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 3/11/19, 3:04 PM, "slurm-users on behalf of Julius, Chad" 
 
wrote:

All, 
 
I am new to Slurm and was just wondering if someone has a link or info on 
getting Slurm to send mail to users.  Are you using sendmail, postfix or ???.  
I have been asked to get the --mail-user option working but I am not sure how 
Slurm
 ties into mail.  Does mail have to listen for messages from Slurm?  What 
changes have to be made to slurm.conf? 

 
Any thoughts, help out there?
 
Thanks,
 
Chad 
 





Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-03-04 Thread Christopher Benjamin Coffey
You are welcome Loris!

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 2/26/19, 8:16 AM, "slurm-users on behalf of Loris Bennett" 
 
wrote:

Hi Chris,

I had

  JobAcctGatherType=jobacct_gather/linux
  TaskPlugin=task/affinity
  ProctrackType=proctrack/cgroup

ProctrackType was actually unset but cgroup is the default.

I have now changed the settings to 

  JobAcctGatherType=jobacct_gather/cgroup
  TaskPlugin=task/affinity,task/cgroup
  ProctrackType=proctrack/cgroup

and added 

  TaskAffinity=no
  ConstrainCores=yes
  ConstrainRAMSpace=yes

For at least one job this gives me the following for a running job:

  $ seff -d 4896
  Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes 
Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
  Slurm data: 4896  loris sc RUNNING curta 8 2 2 2097152 0 0 33 
3.6028797018964e+16 0

  Job ID: 4896
  Cluster: curta
  User/Group: loris/sc
  State: RUNNING
  Nodes: 2
  Cores per node: 4
  CPU Utilized: 00:00:00
  CPU Efficiency: 0.00% of 00:04:24 core-walltime
  Job Wall-clock time: 00:00:33
  Memory Utilized: 32.00 EB (estimated maximum)
  Memory Efficiency: 1717986918400.00% of 2.00 GB (256.00 MB/core)
  WARNING: Efficiency statistics may be misleading for RUNNING jobs.

and this at completion:

  $ seff -d 4896
  Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes 
Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
  Slurm data: 4896  loris sc COMPLETED curta 8 2 2 2097152 0 0 61 59400 0

  Job ID: 4896
  Cluster: curta
  User/Group: loris/sc
  State: COMPLETED (exit code 0)
  Nodes: 2
  Cores per node: 4
  CPU Utilized: 00:00:00
  CPU Efficiency: 0.00% of 00:08:08 core-walltime
  Job Wall-clock time: 00:01:01
  Memory Utilized: 58.01 MB (estimated maximum)
  Memory Efficiency: 2.83% of 2.00 GB (256.00 MB/core)

which looks good.  I'll see how it goes with longer running job.

Thanks for the input,

Loris

Christopher Benjamin Coffey  writes:

> Hi Loris,
>
> Odd, we never saw that issue with memory efficiency being out of whack, 
just the cpu efficiency. We are running 18.08.5-2 and here is a 512 core job 
run last night:
>
> Job ID: 18096693
> Array Job ID: 18096693_5
> Cluster: monsoon
> User/Group: abc123/cluster
> State: COMPLETED (exit code 0)
> Nodes: 60
> Cores per node: 8
> CPU Utilized: 01:34:06
> CPU Efficiency: 58.04% of 02:42:08 core-walltime
> Job Wall-clock time: 00:00:19
> Memory Utilized: 36.04 GB (estimated maximum)
> Memory Efficiency: 30.76% of 117.19 GB (1.95 GB/node
>
> What job collection, task, and proc track plugin are you using I'm 
curious? We are using:
>
> JobAcctGatherType=jobacct_gather/cgroup
> TaskPlugin=task/cgroup,task/affinity
> ProctrackType=proctrack/cgroup
>
> Also cgroup.conf:
>
> ConstrainCores=yes
> ConstrainRAMSpace=yes
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
>
> On 2/26/19, 2:15 AM, "slurm-users on behalf of Loris Bennett" 
 
wrote:
>
> Hi,
> 
> With seff 18.08.5-2 we have been getting spurious results regarding
> memory usage:
> 
>   $ seff 1230_27
>   Job ID: 1234
>   Array Job ID: 1230_27
>   Cluster: curta
>   User/Group: x/x
>   State: COMPLETED (exit code 0)
>   Nodes: 4
>   Cores per node: 25
>   CPU Utilized: 9-16:49:18
>   CPU Efficiency: 30.90% of 31-09:35:00 core-walltime
>   Job Wall-clock time: 07:32:09
>   Memory Utilized: 48.00 EB (estimated maximum)
>   Memory Efficiency: 26388279066.62% of 195.31 GB (1.95 GB/core)
> 
> It seems that the more cores are involved the worse the overcalulation
> is, but not linearly.
> 
> Has anyone else seen this?
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email 
loris.benn...@fu-berlin.de
> 
> 
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de





Re: [slurm-users] TotalCPU: sacct reporting inexplicable high values

2019-02-01 Thread Christopher Benjamin Coffey
Nico, yep that’s a very annoying bug as we do the same here with job 
efficiency. It was patched in 18.08.05. However the db still needs to be 
cleaned up. We are working on a script to fix this. When we are done, we'll 
offer it up to the list.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 2/1/19, 8:47 AM, "slurm-users on behalf of nico.faer...@id.unibe.ch" 
 
wrote:

Hi


While doing some statistics on efficient CPU usage, I realized that sacct 
is reporting inexplicable (at least for me) high values for TotalCPU, UserCPU 
and SystemCPU. Here is a simple example (each job step is a infinite while 
loop): 


sacct -j 64338003 
--format=jobid,elapsed,ncpus,cputime,totalcpu,usercpu,systemcpu,nodelist
   JobIDElapsed  NCPUSCPUTime   TotalCPUUserCPU  
SystemCPUNodeList
 -- -- -- -- -- 
-- ---
64338003   00:02:29   4  00:09:5613:19:41 13:19:36  
  00:05.054  anode033
64338003.ba+   00:02:314  00:10:0400:09.01700:04.003  
00:05.014  anode033
64338003.ex+   00:02:304  00:10:0000:00.00100:00:00
00:00.001  anode033
64338003.0 00:02:32  1  00:02:3203:19:52 
03:19:5200:00.013  anode033
64338003.1 00:02:32  1  00:02:3203:19:54 
03:19:5400:00.008  anode033
64338003.2 00:02:32  1  00:02:3203:19:53 03:19:53   
 00:00.010  anode033
64338003.3 00:02:32  1  00:02:3203:19:52 
03:19:5200:00.007  anode033


I would expect CPUTime to be the upper limit for TotalCPU.


Looking at cpuacct.stat for job step3:


cat /cgroup/cpuacct/slurm/uid_6994/job_64338003/step_3/cpuacct.stat
user 14902   (~149 = 00:02:29)  
system 0


This value corresponds to the expected CPU usage of a single job step.


We are running Slurm 18.08.4 with
JobAcctGatherType=jobacct_gather/cgroup



Does anyone have an explanation for those high values reported by sacct?





Best,
Nico


Universitaet BernAbt. Informatikdienste

Nico Färber
High Performance Computing


Gesellschaftsstrasse 6
CH-3012 Bern
Raum 104
Tel. +41 (0)31 631 51 89










Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
To be more clear, the jobs aren't starting due to the group being at their 
limit, which is normal. But slurm is spamming that error to the log file for 
every job that is at a particular GrpTRESRunLimit which is not normal. 

Other than the log being littered with incorrect error messages, things appear 
to work as normal.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/31/19, 9:23 AM, "slurm-users on behalf of Christopher Samuel" 
 wrote:

On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote:

> This seems to be related to jobs that can't start due to in our case:
> 
> AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit
> 
> Must be a bug relating to GrpTRESRunLimit it seems.

Do you mean can't start due to not enough time, or can't start due to a 
bug related to those limits?

That doesn't sound good..

cheers,
Chris





Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
Hi All,

This seems to be related to jobs that can't start due to in our case: 

AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit

Must be a bug relating to GrpTRESRunLimit it seems.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/31/19, 8:30 AM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the 
slurmctld logs:

[2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() 
for JobId=16599048 with not NULL job resources
[2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() 
for JobId=16597556 with not NULL job resources
[2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() 
for JobId=16597557 with not NULL job resources

Any ideas what this is about? It doesn't make sense to me. This is how job 
16597557 looks;

JobId=16597576 JobName=cred10_5_5_ci2_eu3_eu4_ciK_a
   UserId=abc123(3760) GroupId=cluster(3301) MCS_label=N/A
   Priority=123577 Nice=0 Account=afghah QOS=prof1
   JobState=PENDING Reason=AssocGrpMemRunMinutes Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-12:00:00 TimeMin=N/A
   SubmitTime=2019-01-30T17:17:53 EligibleTime=2019-01-30T17:19:54
   AccrueTime=2019-01-30T17:19:54
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-01-31T08:26:00
   Partition=all AllocNode:Sid=wind:43691
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn14
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=18400M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=18400M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/abc123/class/credit_V2_ch_12/10_5_5_ci2_eu3_eu4_ciK_a.sh
   WorkDir=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/
   
StdErr=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/output.txt
   StdIn=/dev/null
   
StdOut=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/output.txt
   Power=

Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 





[slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the 
slurmctld logs:

[2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() for 
JobId=16599048 with not NULL job resources
[2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() for 
JobId=16597556 with not NULL job resources
[2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() for 
JobId=16597557 with not NULL job resources

Any ideas what this is about? It doesn't make sense to me. This is how job 
16597557 looks;

JobId=16597576 JobName=cred10_5_5_ci2_eu3_eu4_ciK_a
   UserId=abc123(3760) GroupId=cluster(3301) MCS_label=N/A
   Priority=123577 Nice=0 Account=afghah QOS=prof1
   JobState=PENDING Reason=AssocGrpMemRunMinutes Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-12:00:00 TimeMin=N/A
   SubmitTime=2019-01-30T17:17:53 EligibleTime=2019-01-30T17:19:54
   AccrueTime=2019-01-30T17:19:54
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-01-31T08:26:00
   Partition=all AllocNode:Sid=wind:43691
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn14
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=18400M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=18400M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/abc123/class/credit_V2_ch_12/10_5_5_ci2_eu3_eu4_ciK_a.sh
   WorkDir=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/
   StdErr=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/output.txt
   StdIn=/dev/null
   StdOut=/scratch/abc123/class/credit_V2_12/10_5_5_ci2_eu3_eu4_ciK_a/output.txt
   Power=

Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-14 Thread Christopher Benjamin Coffey
Hi David,

You are welcome. I'm surprised that srun does not work for you. We advise our 
users to use srun on every type of job, not just MPI. This in our opinion keeps 
it simple, and it just works. What is your MpiDefault set to in slurm.conf? Is 
your openmpi built with slurm support? I believe it’s the default, so it 
should. As you probably know, mpi implementations have to be recompiled when 
slurm is upgraded between major versions. FWIW, this is how we have openmpi 
configured on our cluster:

./configure --prefix=/packages/openmpi/3.1.3-gcc-6.2.0 
--with-ucx=/packages/ucx/1.4.0 --with-slurm --with-verbs --with-lustre 
--with-pmi

What happens when srun "doesn't work"?

I'm unaware of a way to suppress CR_Pack_Nodes in jobs.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/11/19, 10:07 AM, "slurm-users on behalf of Baker D.J." 
 
wrote:

Hi Chris,


Thank you for your comments. Yesterday I experimented with increasing the 
PriorityWeightJobSize and that does appear to have quite a profound effect on 
the job mix
 executing at any one time. Larger jobs (needing 5 nodes or above) are now 
getting a decent share of the nodes in the cluster. I've been running test jobs 
in between other bits of work and things are looking much better. I expected 
the change to be a little
 too aggressive, but the job mix is now very good overall. 


Thank you for your suggested changes to the slurm.conf...


SelectTypeParameters=CR_Pack_Nodes
SchedulerParameters=pack_serial_at_end, bf_busy_nodes


I especially like the idea of using "CR_Pack_Nodes" since the same node 
packing policy is in operation on our Moab cluster. On the other hand we advise 
launching OpenMPI jobs using mpirun (it does work and it does detect the 
resources requested in the job).
 In fact despite installing OpenMPI with the pmi devices srun does not work 
for some reason! If you use mpirun, do you know if there is a valid way for 
users to suppress  CR_Pack_Nodes  in their jobs?


Best regards,
David


From: slurm-users  on behalf of 
Skouson, Gary 
Sent: 11 January 2019 16:53
To: Slurm User Community List
Subject: Re: [slurm-users] Larger jobs tend to get starved out on our 
cluster 

You should be able to turn on some backfill debug info from slurmctl, You 
can have slurm output the backfill info.  Take a look at DebugFlags settings 
using
 Backfill and BackfillMap.
 
Your bf_window is set to 3600 or 2.5 days, if the start time of the large 
job is out further than that, it won’t get any nodes reserved.

You may also want to take a look at the bf_window_linear parameter.  By 
default the backfill window search starts at 30 seconds and doubles at each 
iteration. 
 Thus jobs that will need to wait a couple of days to gather the required 
resources will have a resolution in the backfill reservation that’s more than a 
day wide.  Even if nodes will be available 2 days from now, the “reservation” 
may be out 3 days, allowing
 2-day jobs to sneak in before the large job.  The result is that small 
jobs that last 1-2 days can delay the start of a large job for weeks.
 
You can turn on bf_window_linear and it’ll keep that from happening.  
Unfortunately, it means that there are more backfill iterations required to 
search out
 multiple days into the future.   If you have relatively few jobs, that may 
not matter.  If you have lots of jobs, it’ll slow things down a bit.  You’ll 
have to do some testing to see if that’ll work for you.
 
-
Gary Skouson
 
 
From: slurm-users 
On Behalf Of Baker D.J.
Sent: Wednesday, January 9, 2019 11:40 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Larger jobs tend to get starved out on our cluster


 
Hello,

 

A colleague intimated that he thought that larger jobs were tending to get 
starved out on our slurm cluster. It's not a busy time at the moment so it's 
difficult to test this
 properly. Back in November it was not completely unusual for a larger job 
to have to wait up to a week to start. 

 

I've extracted the key scheduling configuration out of the slurm.conf and I 
would appreciate your comments, please. Even at the busiest of times we notice 
many single compute
 jobs executing on the cluster -- starting either via the scheduler or by 
backfill.

 

Looking at the scheduling configuration do you think that I'm favouring 
small jobs too much? That is, for example, should I increase the 
PriorityWeightJobSize to encourage larger
 jobs to run? 

 

I was very keen not to starve out small/medium jobs, however perhaps there 
is too much emphasis on small/medium jobs in our setup. 

 

My 

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-10 Thread Christopher Benjamin Coffey
We've attempted setting JobAcctGatherFrequency=task=0 and there is no change. 
We have settings:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Odd ... wonder why we don't see it help.

Here is how we verify:

===
#!/bin/bash
#SBATCH --job-name=lazy # the name of your job
#SBATCH --output=/scratch/blah/lazy.txt# this is the file your output and 
errors go to
#SBATCH --time=20:00# max time 
#SBATCH --workdir=/scratch/blah# your work directory
#SBATCH --mem=7000 # total mem
#SBATCH -c4 # 4 cpus

# use 500MB of memory and 1 cpu thread 
#srun stress -m 1 --vm-bytes 500M --timeout 65s

# use 500MB of memory and 3 cpu threads, 1 memory thread
srun stress -c 3 -m 1 --vm-bytes 500M --timeout 65s
===

Still have jobs with usercpu way too high.

[cbc@head-dev ~ ]$ jobstats
JobID JobName   ReqMemMaxRSSReqCPUS   UserCPU Timelimit   
ElapsedState   JobEff  
=
7957  lazy  9.77G 0.0M  4 00:00:0000:20:00
00:00:00   FAILED  -  
7958  lazy  6.84G 0.0M  4 00:00.018   00:20:00
00:00:01   FAILED  -  
7959  lazy  6.84G 480M  4 01:51.269   00:20:00
00:01:06   COMPLETED   18.17   
7960  lazy  6.84G 499M  4 02:01.275   00:20:00
00:01:06   COMPLETED   19.53   
7961  lazy  6.84G 499M  4 01:55.259   00:20:00
00:01:06   COMPLETED   18.76   
7962  lazy  6.84G 499M  4 01:58.307   00:20:00
00:01:06   COMPLETED   19.15   
7963  lazy  6.84G 491M  4 02:01.267   00:20:00
00:01:06   COMPLETED   19.49   
7964  lazy  6.84G 499M  4 02:01.270   00:20:00
00:01:05   COMPLETED   19.73   
7965  lazy  6.84G 500M  4 02:04.336   00:20:00
00:01:05   COMPLETED   20.13   
7966  lazy  6.84G 468M  4 04:58:5600:20:00
00:01:05   COMPLETED   2303.53   
7967  lazy  6.84G 464M  4 04:40:3900:20:00
00:01:05   COMPLETED   2162.87   
7968  lazy  6.84G 440M  4 05:20:2200:20:00
00:01:05   COMPLETED   2468.26   
7969  lazy  6.84G 500M  4 05:14:3700:20:00
00:01:05   COMPLETED   2424.32   
7970  lazy  6.84G 278M  4 02:56:3900:20:00
00:01:06   COMPLETED   1341.42   
7971  lazy  6.84G 265M  4 02:57:1800:20:00
00:01:06   COMPLETED   1346.28   
7972  lazy  6.84G 500M  4 02:54:3800:20:00
00:01:06   COMPLETED   1327.2   
7973  lazy  6.84G 426M  4 02:29:5000:20:00
00:01:06   COMPLETED   1138.96   
=

Requested Memory: 06.49%
Requested Cores : 2906.81%
Time Limit  : 05.47%

Efficiency Score: 972.92



—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/9/19, 7:24 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:

On Wed, Jan 09, 2019 at 12:44:03PM +0100, Bj?rn-Helge Mevik wrote:

> Paddy Doyle  writes:
> 
> > Looking back through the mailing list, it seems that from 2015 onwards 
the
> > recommendation from Danny was to use 'jobacct_gather/linux' instead of
> > 'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept 
with
> > the cgroup version.
> >
> > Is anyone else still using jobacct_gather/cgroup and are you seeing this
> > same issue?
> 
> Just a side note: In last year's SLUG, Tim recommended the following
> settings:
> 
> proctrack/cgroup, task/cgroup, jobacct_gather/cgroup
> 
> So the recommendation for jobacct_gather might have changed -- or Danny
> and Tim might just have different opinions. :)

Interesting... the cgroups documentation page still says the performance of
jobacct_gather/cgroup is worse than jobacct_gather/linux. Although
according to the git commits of doc/html/cgroups.shtml, that was added to
the page in Jan 2015, so yeah maybe things have changed again. :)


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fcgroups.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378sdata=i634oCV0NeO6DvBos05gM3iF7YxI%2FJC%2BZC7MJ222SW8%3Dreserved=0

In that case, either set 'JobAcctGatherFrequency=task=0' or wait for the

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-10 Thread Christopher Benjamin Coffey
Hi D.J.,

I noticed you have:

PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE

I'm pretty sure it does not makes sense to have depth oblivious, and fair tree 
set at the same time. You'll want to choose one of them. That’s not going to be 
reason for the issue however, but you are likely not running the fairshare 
algorithm that was intended.


"My colleague is from a Moab background, and in that respect he was
> surprised not to see nodes being reserved for jobs, but it could be
> that Slurm works in a different way to try to make efficient use of
> the cluster by backfilling more aggressively than Moab."

Slurm unfortunately does not indicate when nodes are being put aside for large 
jobs. I wish that it did. Nodes will instead be in "idle" state when prepping 
for a large job.

To increase the possibility of more whole nodes being available for large MPI 
jobs to get them started faster, you might consider the following parameters:

SelectTypeParameters=CR_Pack_Nodes

And 

SchedulerParameters=pack_serial_at_end, bf_busy_nodes

Also, as Loris pointed out, bf_window will need to be set to the max wall time 
in minutes.

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/9/19, 11:52 PM, "slurm-users on behalf of Loris Bennett" 
 
wrote:

Hi David,

If your maximum run-time is more than the 2 1/2 days (3600 minutes) you
have set for bf_window, you might need to increase bf_window
accordingly.   See the description here:


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cc83886a4754c403440c408d676c828f2%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826999384921564sdata=ONaDCNBDktzNLoJDxBtDkz9g9J4XQr1chN6ijaF0vQg%3Dreserved=0

Cheers,

Loris

Baker D.J.  writes:

> Hello,
>
> A colleague intimated that he thought that larger jobs were tending to
> get starved out on our slurm cluster. It's not a busy time at the
> moment so it's difficult to test this properly. Back in November it
> was not completely unusual for a larger job to have to wait up to a
> week to start.
>
> I've extracted the key scheduling configuration out of the slurm.conf
> and I would appreciate your comments, please. Even at the busiest of
> times we notice many single compute jobs executing on the cluster --
> starting either via the scheduler or by backfill.
>
> Looking at the scheduling configuration do you think that I'm
> favouring small jobs too much? That is, for example, should I increase
> the PriorityWeightJobSize to encourage larger jobs to run?
>
> I was very keen not to starve out small/medium jobs, however perhaps
> there is too much emphasis on small/medium jobs in our setup.
>
> My colleague is from a Moab background, and in that respect he was
> surprised not to see nodes being reserved for jobs, but it could be
> that Slurm works in a different way to try to make efficient use of
> the cluster by backfilling more aggressively than Moab. Certainly we
> see a great deal of activity from backfill.
>
> In this respect does anyone understand the mechanism used to reserve
> nodes/resources for jobs in slurm or potentially where to look for
> that type of information.
>
> Best regards,
> David
>
> SchedulerType=sched/backfill
> SchedulerParameters=bf_window=3600,bf_resolution=180,bf_max_job_user=4
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> PriorityFavorSmall=NO
> PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
>
> PriorityWeightFairshare=100
> PriorityWeightAge=10
> PriorityWeightPartition=0
> PriorityWeightJobSize=10
> PriorityWeightQOS=1
> PriorityMaxAge=7-0
>
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de





Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu values

2019-01-09 Thread Christopher Benjamin Coffey
Thanks... looks like the bug should get some attention now that a paying site 
is complaining:

https://bugs.schedmd.com/show_bug.cgi?id=6332

Thanks Jurij!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/9/19, 7:24 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:

On Wed, Jan 09, 2019 at 12:44:03PM +0100, Bj?rn-Helge Mevik wrote:

> Paddy Doyle  writes:
> 
> > Looking back through the mailing list, it seems that from 2015 onwards 
the
> > recommendation from Danny was to use 'jobacct_gather/linux' instead of
> > 'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept 
with
> > the cgroup version.
> >
> > Is anyone else still using jobacct_gather/cgroup and are you seeing this
> > same issue?
> 
> Just a side note: In last year's SLUG, Tim recommended the following
> settings:
> 
> proctrack/cgroup, task/cgroup, jobacct_gather/cgroup
> 
> So the recommendation for jobacct_gather might have changed -- or Danny
> and Tim might just have different opinions. :)

Interesting... the cgroups documentation page still says the performance of
jobacct_gather/cgroup is worse than jobacct_gather/linux. Although
according to the git commits of doc/html/cgroups.shtml, that was added to
the page in Jan 2015, so yeah maybe things have changed again. :)


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fcgroups.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378sdata=i634oCV0NeO6DvBos05gM3iF7YxI%2FJC%2BZC7MJ222SW8%3Dreserved=0

In that case, either set 'JobAcctGatherFrequency=task=0' or wait for the
bug to be fixed.

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378sdata=S2PCubxVUifigrvyEnmFdrQb5G9Ak4roM2FJtUxiM%2Fw%3Dreserved=0





Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-08 Thread Christopher Benjamin Coffey
" Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version."

Ahh, hmm I need to dig up that recommendation as I didn't see that myself. 
We'll look into this.

Thanks Paddy!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/8/19, 8:04 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:

A small addition: I forgot to mention our JobAcct params:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/cgroup

I've done a small bit of playing around on a test cluster. Changing to
'JobAcctGatherFrequency=0' (i.e. only gather at job end) seems to then give
correct values for the job via sacct/seff.

Alternatively, setting the following also works:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/linux

Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version.

Is anyone else still using jobacct_gather/cgroup and are you seeing this
same issue?

Just to note: there's a big warning in the man page not to adjust the
value of JobAcctGatherType while there are any running job steps. I'm not
sure if that means just on that node, or any jobs. Probably safest to
schedule a downtime to change it.

Paddy

On Fri, Jan 04, 2019 at 10:43:54PM +, Christopher Benjamin Coffey wrote:

> Actually we double checked and are seeing it in normal jobs too.
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> ???On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:
> 
> Hi Chris,
> 
> We're seeing it on 18.08.3, so I was hoping that it was fixed in 
18.08.4
> (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it 
in
> regular jobs (haven't tested job arrays).
> 
> I think it's cgroups-related; there's a similar bug here:
> 
> 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095data=02%7C01%7Cchris.coffey%40nau.edu%7Ca8652902673c4688948308d6757a9957%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636825566755040599sdata=sXhuwF0AcUByzXEiBrg%2BFXw4Niowhs%2B9g0uFDpq%2F19g%3Dreserved=0
> 
> I was hoping that this note in the 18.08.4 NEWS might have been 
related:
> 
> -- Fix jobacct_gather/cgroup to work correctly when more than one 
task is
>started on a node.
    > 
    >     Thanks,
> Paddy
> 
> On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey 
wrote:
> 
> > I'm surprised no one else is seeing this issue? I wonder if you 
have 18.08 you can take a moment and run jobeff on a job in one of your users 
job arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
issue is that usercpu is incorrect, and off by many orders of magnitude.
> > 
> > Best,
> > Chris
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" 
 wrote:
> > 
> > So this issue is occurring only with job arrays.
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce 
Carl Nelson"  wrote:
> > 
> > Hi folks,
> > 
> > 
> > calling sacct with the usercpu flag enabled seems to 
provide cpu times far above expected values for job array indices. This is also 
reported by seff. For example, executing the following job script:
> > 
> > 
> > 
> > #!/b

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-04 Thread Christopher Benjamin Coffey
Actually we double checked and are seeing it in normal jobs too.

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:

Hi Chris,

We're seeing it on 18.08.3, so I was hoping that it was fixed in 18.08.4
(recently upgraded from 17.02 to 18.08.3). Note that we're seeing it in
regular jobs (haven't tested job arrays).

I think it's cgroups-related; there's a similar bug here:


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095data=02%7C01%7Cchris.coffey%40nau.edu%7C4a028bf1e7ef4ad82eb808d672612269%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636822158848154399sdata=OLX%2FiHHUqvE1CR74lViVq2b91z9bR9GmfSETeDlTEoA%3Dreserved=0

I was hoping that this note in the 18.08.4 NEWS might have been related:

-- Fix jobacct_gather/cgroup to work correctly when more than one task is
   started on a node.

Thanks,
Paddy

On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey wrote:

> I'm surprised no one else is seeing this issue? I wonder if you have 
18.08 you can take a moment and run jobeff on a job in one of your users job 
arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
issue is that usercpu is incorrect, and off by many orders of magnitude.
> 
> Best,
> Chris
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" 
 wrote:
> 
> So this issue is occurring only with job arrays.
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl 
Nelson"  wrote:
> 
> Hi folks,
> 
> 
> calling sacct with the usercpu flag enabled seems to provide cpu 
times far above expected values for job array indices. This is also reported by 
seff. For example, executing the following job script:
> 
> 
> 
> #!/bin/bash
> #SBATCH --job-name=array_test   
> #SBATCH --workdir=/scratch/cbn35/bigdata  
> #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
> #SBATCH --time=20:00  
> #SBATCH --array=1-5
> #SBATCH -c2
> 
> 
> srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
> 
> 
> 
> 
> 
> 
> ...results in the following stats:
> 
> 
> 
> 
>JobID  ReqCPUSUserCPU  TimelimitElapsed 
>   -- -- -- 
> 15730924_5  2   02:30:14   00:20:00   00:01:08 
> 15730924_5.+2  00:00.004  00:01:08 
> 15730924_5.+2   00:00:00  00:01:09 
> 15730924_5.02   02:30:14  00:01:05 
> 15730924_1  2   02:30:48   00:20:00   00:01:08 
> 15730924_1.+2  00:00.013  00:01:08 
> 15730924_1.+2   00:00:00  00:01:09 
> 15730924_1.02   02:30:48  00:01:05 
> 15730924_2  2   02:15:52   00:20:00   00:01:07 
> 15730924_2.+2  00:00.007  00:01:07 
> 15730924_2.+2   00:00:00  00:01:07 
> 15730924_2.02   02:15:52  00:01:06 
> 15730924_3  2   02:30:20   00:20:00   00:01:08 
> 15730924_3.+2  00:00.010  00:01:08 
> 15730924_3.+2   00:00:00  00:01:09 
> 15730924_3.02   02:30:20  00:01:05 
> 15730924_4  2   02:30:26   00:20:00   00:01:08 
> 15730924_4.+2  00:00.006  00:01:08 
> 15730924_4.+2   00:00:00  00:01:09 
> 15730924_4.0   

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2018-12-21 Thread Christopher Benjamin Coffey
So this issue is occurring only with job arrays.

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl Nelson" 
 
wrote:

Hi folks,


calling sacct with the usercpu flag enabled seems to provide cpu times far 
above expected values for job array indices. This is also reported by seff. For 
example, executing the following job script:



#!/bin/bash
#SBATCH --job-name=array_test   
#SBATCH --workdir=/scratch/cbn35/bigdata  
#SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
#SBATCH --time=20:00  
#SBATCH --array=1-5
#SBATCH -c2


srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s






...results in the following stats:




   JobID  ReqCPUSUserCPU  TimelimitElapsed 
  -- -- -- 
15730924_5  2   02:30:14   00:20:00   00:01:08 
15730924_5.+2  00:00.004  00:01:08 
15730924_5.+2   00:00:00  00:01:09 
15730924_5.02   02:30:14  00:01:05 
15730924_1  2   02:30:48   00:20:00   00:01:08 
15730924_1.+2  00:00.013  00:01:08 
15730924_1.+2   00:00:00  00:01:09 
15730924_1.02   02:30:48  00:01:05 
15730924_2  2   02:15:52   00:20:00   00:01:07 
15730924_2.+2  00:00.007  00:01:07 
15730924_2.+2   00:00:00  00:01:07 
15730924_2.02   02:15:52  00:01:06 
15730924_3  2   02:30:20   00:20:00   00:01:08 
15730924_3.+2  00:00.010  00:01:08 
15730924_3.+2   00:00:00  00:01:09 
15730924_3.02   02:30:20  00:01:05 
15730924_4  2   02:30:26   00:20:00   00:01:08 
15730924_4.+2  00:00.006  00:01:08 
15730924_4.+2   00:00:00  00:01:09 
15730924_4.02   02:30:25  00:01:05 






This is also reported by seff, with several errors to boot:




Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 
130,  line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 
130,  line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 
130,  line 624.
Job ID: 15730924
Array Job ID: 15730924_5
Cluster: monsoon
User/Group: cbn35/clusterstu
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 03:19:15
CPU Efficiency: 8790.44% of 00:02:16 core-walltime
Job Wall-clock time: 00:01:08
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)









As far as I can tell, I don't think a two core job with an elapsed time of 
around one minute would have a cpu time of two hours. Could this be a 
configuration issue, or is it a possible bug? 


More info is available on request, and any help is appreciated!








[slurm-users] Slurm mysql 8.0

2018-12-14 Thread Christopher Benjamin Coffey
Hi Guys,

It appears that slurm currently doesn't support mysql 8.0. After upgrading from 
5.7 to 8.0 slurm commands that hit the db result in:

sacct: error: slurmdbd: "Unknown error 1064"

This is at least true for version 17.11.12. I wonder what the plan is for 
slurm to support mariadb, and mysql when they diverge in compatibility at mysql 
8.0?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-04 Thread Christopher Benjamin Coffey
Interesting! I'll have a look - thanks!

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 11/30/18, 1:41 AM, "slurm-users on behalf of John Hearns" 
 
wrote:

Chris, I have delved deep into the OOM killer code and interaction with 
cpusets in the past (*).
That experience is not really relevant!
However I always recommend looking at this sysctl parameter   
min_free_kbytes

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables
 
<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_enterprise_linux%2F6%2Fhtml%2Fperformance_tuning_guide%2Fs-memory-tunables=02%7C01%7Cchris.coffey%40nau.edu%7Ccb53627af161491c8e0b08d6569fa903%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636791641065369560=dh8i175jGQWEWok7SU%2B%2Fxq%2FsotofuCqgrYfJtKkShEg%3D=0>

I think of this as the 'wriggle room' the system has when it is playing 
Tetris with memory pages.
In there past the min_free_kbytes value was by default ridiculously low. 
Look at what value you have and consider increasing by a big factor.




(*)A long time ago, when managing a large NUMA machine, I had an 
application which would have a memory leak and trigger the OOM killer.
The application was being run in a cgroup with cpuset and memory location 
limits. Once the OOM killer kicked in I saw that the system went 'off the air' 
for two minutes, which was quite serious. Finally traced to the OOM killer 
inspecting every page in
 memory on the system, independent of the cgroup, to see if it had been 
touched by this process.


























On Fri, 30 Nov 2018 at 09:31, Ole Holm Nielsen  
wrote:


On 29-11-2018 19:27, Christopher Benjamin Coffey wrote:
> We've been noticing an issue with nodes from time to time that become 
"wedged", or unusable. This is a state where ps, and w hang. We've been looking 
into this for a while when we get time and finally put some more effort into it 
yesterday. We came across
 this blog which describes almost the exact scenario:
> 
> 
https://rachelbythebay.com/w/2014/10/27/ps/ 
<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frachelbythebay.com%2Fw%2F2014%2F10%2F27%2Fps%2F=02%7C01%7Cchris.coffey%40nau.edu%7Ccb53627af161491c8e0b08d6569fa903%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636791641065369560=tLXSPBftF5ysUoak7kRdDOIkeFeZwRuc9eTsLrOfzis%3D=0>
> 
> It has nothing to do with Slurm, but it does have to do with cgroups 
which we have enabled. It appears that processes that have hit their ceiling 
for memory and should be killed by oom-killer, and are in D state at the same 
time, cause the system to become
 wedged. For each node wedged, I've found a job out in:
> 
> /cgroup/memory/slurm/uid_3665/job_15363106/step_batch
> - memory.max_usage_in_bytes
> - memory.limit_in_bytes
> 
> The two files are the same bytes, which I'd think would be a candidate 
for oom-killer. But memory.oom_control says:
> 
> oom_kill_disable 0
> under_oom 0
> 
> My feeling is that the process was in D state, the oom-killer tried to be 
invoked, but then didn't and the system became wedged.
> 
> Has anyone run into this? If so, whats the fix? Apologies if this has 
been discussed before, I haven't noticed it on the group.
> 
> I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a 
more recent kernel but looking at the kernels in the 6.10 series it doesn't 
look like a newer one would have a patch for a oom-killer bug.
> 
> Our setup is:
> 
> Centos 6.10
> 2.6.32-642.6.2.el6.x86_64
> Slurm 17.11.12

As far as I remember, Cgroups underwent a major upgrade with RHEL/CentOS 
7.  Maybe you should try to upgrade to CentOS 7.5 :-)

/Ole








[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-29 Thread Christopher Benjamin Coffey
Hi,

We've been noticing an issue with nodes from time to time that become "wedged", 
or unusable. This is a state where ps, and w hang. We've been looking into this 
for a while when we get time and finally put some more effort into it 
yesterday. We came across this blog which describes almost the exact scenario:

https://rachelbythebay.com/w/2014/10/27/ps/

It has nothing to do with Slurm, but it does have to do with cgroups which we 
have enabled. It appears that processes that have hit their ceiling for memory 
and should be killed by oom-killer, and are in D state at the same time, cause 
the system to become wedged. For each node wedged, I've found a job out in:

/cgroup/memory/slurm/uid_3665/job_15363106/step_batch
- memory.max_usage_in_bytes
- memory.limit_in_bytes

The two files are the same bytes, which I'd think would be a candidate for 
oom-killer. But memory.oom_control says:

oom_kill_disable 0
under_oom 0

My feeling is that the process was in D state, the oom-killer tried to be 
invoked, but then didn't and the system became wedged.

Has anyone run into this? If so, whats the fix? Apologies if this has been 
discussed before, I haven't noticed it on the group.

I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more 
recent kernel but looking at the kernels in the 6.10 series it doesn't look 
like a newer one would have a patch for a oom-killer bug.

Our setup is:

Centos 6.10
2.6.32-642.6.2.el6.x86_64
Slurm 17.11.12

And /etc/slurm/cgroup.conf
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

Cheers,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] About x11 support

2018-11-20 Thread Christopher Benjamin Coffey
Hi Chris,

Are you using the built in slurm x11 support? Or that spank plugin? We haven't 
been able to get the right combo of things in place to get the built in x11 to 
work.

Best,
Chris


—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 11/15/18, 5:00 AM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On Thursday, 15 November 2018 9:36:08 PM AEDT Mahmood Naderan wrote:

> Is there any update about native support of x11 in slurm v18?

It works here...

$ srun --x11 xdpyinfo   
srun: job 1744869 queued and waiting for resources
srun: job 1744869 has been allocated resources
name of display:localhost:57.0
version number:11.0
vendor string:The X.Org Foundation
vendor release number:11906000
X.Org version: 1.19.6
[...]

Remember that the internal version in Slurm uses libssh2 and so that 
imposes 
some restrictions on the types of keys that can be used, i.e. they have to 
be 
RSA keys.

Extra info here: 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ffaq.html%23x11data=02%7C01%7Cchris.coffey%40nau.edu%7Cb1be422f13cc4c92ced708d64af1eb02%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636778800224423065sdata=6yV5h9G2fH2AyF9WrdFlkFVRZpsDGj55cvtMwkZ8I7k%3Dreserved=0

You can (apparently) still use the external plugin if you build Slurm 
without 
its internal X11 support.

All the best,
Chris
-- 
 Chris Samuel  :  
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7Cb1be422f13cc4c92ced708d64af1eb02%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636778800224423065sdata=YlVvn%2BkCQe1MUtPTAz33sXDZ%2FIQWcffJL7cLaBrLpbs%3Dreserved=0
  :  Melbourne, VIC








[slurm-users] Slurm Job Efficiency Tools

2018-11-19 Thread Christopher Benjamin Coffey
Hi,

I gave a presentation at SC in the slurm booth on some slurm job efficiency 
tools, and web app that we developed. I figured that maybe others in this group 
could be interested too. If you'd like to see the short presentation, and the 
tools, and links to them, please see this presentation:

https://rcdata.nau.edu/hpcpub/share/slurm_presentation_sc18.pdf

Hope they are useful!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Reserving a GPU

2018-11-05 Thread Christopher Benjamin Coffey
Can anyone else confirm that it is not possible to reserve a GPU? Seems a bit 
strange.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 10/22/18, 10:01 AM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Thanks for confirmation Paul.

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 10/22/18, 9:29 AM, "slurm-users on behalf of R. Paul Wiegand" 
 wrote:

I had the same question and put in a support ticket.  I believe the 
answer is that you cannot.

On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey 
 wrote:


Hi,

I can't figure out how one would create a reservation to reserve a gres 
unit, such as a gpu. The man page doesn't really say that gres is supported for 
a reservation, but it does say tres is supported. Yet, I can't seem to figure 
out how one could specify a
 gpu with tres. I've tried:


scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 tres=gpu/tesla=1
scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 gres=gpu:tesla:1

Is it not possible to reserve a single gpu? Our gpus are requested 
normally in jobs like " --gres=gpu:tesla:1". I'd rather not reserve an entire 
node and thus reserve all of the gpus in it. Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167











Re: [slurm-users] Reserving a GPU

2018-10-22 Thread Christopher Benjamin Coffey
Thanks for confirmation Paul.

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 10/22/18, 9:29 AM, "slurm-users on behalf of R. Paul Wiegand" 
 wrote:

I had the same question and put in a support ticket.  I believe the answer 
is that you cannot.

On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey 
 wrote:


Hi,

I can't figure out how one would create a reservation to reserve a gres 
unit, such as a gpu. The man page doesn't really say that gres is supported for 
a reservation, but it does say tres is supported. Yet, I can't seem to figure 
out how one could specify a
 gpu with tres. I've tried:


scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 tres=gpu/tesla=1
scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 gres=gpu:tesla:1

Is it not possible to reserve a single gpu? Our gpus are requested normally 
in jobs like " --gres=gpu:tesla:1". I'd rather not reserve an entire node and 
thus reserve all of the gpus in it. Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167









[slurm-users] Meaning of assoc_limit_stop

2018-10-22 Thread Christopher Benjamin Coffey
Hi,

My question is in regard to the scheduling parameter: assoc_limit_stop

"If set and a job cannot start due to association limits, then do not attempt 
to initiate any lower priority jobs in that partition. Setting this can 
decrease system throughput and utilization, but avoid  potentially  starving 
larger jobs by preventing them from launching indefinitely."

Does this mean that if some group is at their assoc limit, and this parameter 
is in place, then no other lower priority jobs from other groups in the 
partition will be candidates to be scheduled? This wouldn't make sense to me 
for a site to ever want to do this.

Or does the parameter really mean this:

"If set and a job cannot start due to association limits, then do not attempt 
to initiate any lower priority jobs FROM THAT GROUP in that partition. Setting 
this can decrease system throughput and utilization, but avoid  potentially  
starving larger jobs by preventing them from launching indefinitely."

If the meaning is the latter, than I don't see how it can decrease system 
throughput and utilization. I think if it means the former, we'd want to do 
this so the scheduler isn't worrying about potentially many thousand jobs from 
a group that is at their assoc limit and thus potentially increasing 
responsiveness.

Anyone have this parameter in production that can answer this?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



[slurm-users] Reserving a GPU

2018-10-22 Thread Christopher Benjamin Coffey
Hi,

I can't figure out how one would create a reservation to reserve a gres unit, 
such as a gpu. The man page doesn't really say that gres is supported for a 
reservation, but it does say tres is supported. Yet, I can't seem to figure out 
how one could specify a gpu with tres. I've tried:


scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 tres=gpu/tesla=1
scontrol create reservation starttime=2018-11-10T08:00:00 user=root 
duration=14-00:00:00 gres=gpu:tesla:1

Is it not possible to reserve a single gpu? Our gpus are requested normally in 
jobs like " --gres=gpu:tesla:1". I'd rather not reserve an entire node and thus 
reserve all of the gpus in it. Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Documentation for creating a login node for a SLURM cluster

2018-10-12 Thread Christopher Benjamin Coffey
In addition, fwiw, this login node will have a second network connection of 
course for campus with firewall setup to only allow ssh (and other essential) 
from campus. Also you may consider having some script developed to prevent 
folks from abusing the login node instead of using slurm for their 
computations. We have a policy that allows for user processes to only consume 
30 min of cpu time before they are killed. We have an exception list of course. 
 Also, our login node has lots of devel packages to allow folks to compile 
software, etc. You may want to have a motd on the login node to announce 
changes, or offer tips. Also, the login node doesn't need to have the slurm 
service running, but needs the slurm s/w, current slurm.conf, and munge keys.

Just some things off the top of my head.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 10/12/18, 6:33 AM, "slurm-users on behalf of Michael Gutteridge" 
 wrote:

I'm unaware of specific docs, but I tend to think of these simply as daemon 
nodes that aren't listed in slurm.conf.  We use Ubuntu and the packages we 
install are munge, slurm-wlm, and slurm-client (which
 drags in libslurmXX and slurm-wlm-basic-plugins).  


Then the setup is very similar to slurmd nodes- you need matching UIDs and 
the same munge key.  Access to the same file systems at the same locations as 
on daemon nodes is also advisable.


Hope this helps.


Michael

On Fri, Oct 12, 2018 at 3:38 AM Aravindh Sampathkumar 
 wrote:


Hello.



I built a simple 2 node SLURM cluster for the first time, and I'm able to 
run jobs on the lone compute node from the node that runs Slurmctl. 

However, I'd like to setup a "login node" which only allows users to submit 
jobs to the SLURM cluster and not act as SLURM controller or as a compute node. 
I'm struggling to find documentation about what needs to be installed and
 configured on this login node to be able to submit jobs to the cluster. 



console(login node)  >  slurm controller(runs slurmctld and slurmdbd) 
> compute node (runs slurmd)



Can anybody point me to any relevant docs to configure the login node?



Thanks,


--

  Aravindh Sampathkumar

  
aravi...@fastmail.com 















Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Christopher Benjamin Coffey
That is interesting. It is disabled in 17.11.10:

static bool _enable_pack_steps(void)
{
bool enabled = false;
char *sched_params = slurm_get_sched_params();

if (sched_params && strstr(sched_params, "disable_hetero_steps"))
enabled = false;
else if (sched_params && strstr(sched_params, "enable_hetero_steps"))
enabled = true;
else if (mpi_type && strstr(mpi_type, "none"))
enabled = true;
xfree(sched_params);
return enabled;
}

I wonder if it is ill advised to enable it!? Suppose I could try it. Thanks 
Chris!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 10/10/18, 12:11 AM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On 10/10/18 05:07, Christopher Benjamin Coffey wrote:

> Yet, we get an error: " srun: fatal: Job steps that span multiple
> components of a heterogeneous job are not currently supported". But
> the docs seem to indicate it should work?

Which version of Slurm are you on?  It was disabled by default in
17.11.x (and I'm not even sure it works if you enable it there) and
seems to be enabled by default in 18.08.x.

To see check the _enable_pack_steps() function src/srun/srun.c

All the best,
Chris (currently away in the UK)
-- 
  Chris Samuel  :  
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7Cd8554994428d40e9902c08d62e7f8b5c%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636747522680686112sdata=DwgI40F74yX90rTHa4U4MtR2nPuSYqqlL5XV9XqSOXk%3Dreserved=0
  :  Melbourne, VIC





[slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-09 Thread Christopher Benjamin Coffey
Hi,

I have a user trying to setup a heterogeneous job with one MPI_COMM_WORLD with 
the following:

==
#!/bin/bash
#SBATCH --job-name=hetero  
#SBATCH --output=/scratch/cbc/hetero.txt
#SBATCH --time=2:00
#SBATCH --workdir=/scratch/cbc  
#SBATCH --cpus-per-task=1 --mem-per-cpu=2g --ntasks=1 -C sb
#SBATCH packjob
#SBATCH --cpus-per-task=1 --mem-per-cpu=1g  --ntasks=1 -C sl
#SBATCH --mail-type=START,END

module load openmpi/3.1.2-gcc-6.2.0

srun --pack-group=0,1 ~/hellompi 
===


Yet, we get an error: " srun: fatal: Job steps that span multiple components of 
a heterogeneous job are not currently supported". But the docs seem to indicate 
it should work?

IMPORTANT: The ability to execute a single application across more than one job 
allocation does not work with all MPI implementations or Slurm MPI plugins. 
Slurm's ability to execute such an application can be disabled on the entire 
cluster by adding "disable_hetero_steps" to Slurm's SchedulerParameters 
configuration parameter.

By default, the applications launched by a single execution of the srun command 
(even for different components of the heterogeneous job) are combined into one 
MPI_COMM_WORLD with non-overlapping task IDs.

Does this not work with openmpi? If not, which mpi/slurm config will work? We 
have slurm.conf MpiDefault=pmi2 currently. I've tried a modern openmpi, and 
also mpich, and mvapich2.

Any help would be appreciated, thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-27 Thread Christopher Benjamin Coffey
Hi David,

I'd recommend the following that I've learned from bad experiences upgrading 
between the last major version.

1. Consider upgrading to mysql-server 5.5 or greater

2. Purge/archive unneeded jobs/steps before the upgrade, to make the upgrade as 
quick as possible:

slurmdbd.conf:

ArchiveDir=/common/adm/slurmdb_archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveSteps=no
ArchiveResvs=no
ArchiveSuspend=no
PurgeEventAfter=1month
PurgeJobAfter=6months
PurgeResvAfter=2month
PurgeStepAfter=6months
PurgeSuspendAfter=2month


3. Take a fresh mysql dump after the archives occur:

mysqldump --all-databases > slurm_db.sql


4. Testing the update on another machine, or vm that has a representation of 
your environment (same rpms, configs, etc). Just take your newly created dump 
from production and load it into the test system:

mysql -u root < slurm_db.db


Once you take care of any connection issues in mysql, allowing a different host 
to connect, then you can fire up slumdbd to perform the upgrade. And see how 
long it takes, and what hiccups you will run into. Now you know, and can plan 
your maintenance window accordingly.

Hope that helps! Good luck!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 9/26/18, 8:57 AM, "slurm-users on behalf of Baker D.J." 
 
wrote:

Thank you for your reply. You're correct, the systemd commands aren't 
invoked, however upgrading the slurm rpm effectively pulls the rug from under 
/usr/sbin/slurmctld. The v17.02 slurm rpm provides /usr/sbin/slurmctld,
 but from v17.11 that executable is provided by the slurm-slurmctld rpm. 


In other words, doing a minimal install of just the slurm and the slurmdbd 
rpms deletes the slurmctld executable. I haven't explicitly tested this, 
however I tested the upgrade on a compute node and experimented with
 the slurmd -- the logic should be the same. 


I guess that the question that comes to mind is.. Is it a really big deal 
if the slurmctld process is down whilst the slurmdbd is being upgraded? Bearing 
in mind that I will probably opt to suspend all run jobs and stop
 the partitions during the upgrade.


Best regards,
David


From: slurm-users  on behalf of 
Chris Samuel 
Sent: 26 September 2018 11:26
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08 

On Tuesday, 25 September 2018 11:54:31 PM AEST Baker D. J.  wrote:

> That will certainly work, however the slurmctld (or in the case of my test
> node, the slurmd) will be killed. The logic is that at v17.02 the slurm 
rpm
> provides slurmctld and slurmd. So upgrading that rpm will destroy/kill the
> existing slurmctld or slurmd processes.

If you do that with the --noscripts then will it really kill the process?  
Nothing should invoke the systemd commands with that, should it?  Or do you 
mean taking the libraries, etc, away out underneath of the running process 
will cause it to crash?

Might be worth testing that on on a VM to see if it will happen.

Best of luck!
Chris
-- 
 Chris Samuel  :  

https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=01%7C01%7Cd.j.baker%40soton.ac.uk%7C8b7cb9ecbbfe4644d3fa08d6239b7821%7C4a5378f929f44d3ebe89669d03ada9d8%7C1sdata=hdM3hZuFetDEqdCYj4VCrgCZ8hOC2FGsBuS8Ql74Ly0%3Dreserved=0
 

 
 :  Melbourne, VIC












Re: [slurm-users] Slurm strigger configuration

2018-09-19 Thread Christopher Benjamin Coffey
Killian, thank you very much! Never noticed the perm flag!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 9/19/18, 10:01 AM, "slurm-users on behalf of Kilian Cavalotti" 
 wrote:

On Wed, Sep 19, 2018 at 9:21 AM Christopher Benjamin Coffey
 wrote:
> The only thing that I've gotten working so far is this:
> sudo -u slurm bash -c "strigger --set -D -n cn15 -p 
/common/adm/slurm/triggers/nodestatus"
>
> So, that will run the nodestatus script which emails when the node cn15 
gets set into drain state. What I'd like to do, which I haven't put time into 
figuring out, is how to setup a persistent trigger that can run when ANY node 
goes into drain state. Let me know if you figure that out. As you can see 
above, the trigger has to be setup by the slurm user.

strigger takes a "--flags=perm" option, which makes the trigger
permanent and doesn't purge it after the event happened. So it doesn't
need tb re-armed in the trigger script for the next event.

Also, not specifying any job id nor node name in the strigger command
will make the trigger apply to all nodes. That's how we set our
default "down" trigger for all nodes on our cluster:
# su -s /bin/bash -c "strigger --set --down
--program=/share/admin/scripts/slurm/triggers/down.sh --flags=perm"
slurm

Also note that the script given to "--program" needs to be executable
by user Slurm on the controller node(s).

Cheers,
-- 
Kilian





Re: [slurm-users] Slurm strigger configuration

2018-09-19 Thread Christopher Benjamin Coffey
Hi Jodie,

The only thing that I've gotten working so far is this:

sudo -u slurm bash -c "strigger --set -D -n cn15 -p 
/common/adm/slurm/triggers/nodestatus"

So, that will run the nodestatus script which emails when the node cn15 gets 
set into drain state. What I'd like to do, which I haven't put time into 
figuring out, is how to setup a persistent trigger that can run when ANY node 
goes into drain state. Let me know if you figure that out. As you can see 
above, the trigger has to be setup by the slurm user.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 9/19/18, 8:48 AM, "slurm-users on behalf of Jodie H. Sprouse" 
 wrote:




Good morning. 
I’m struggling with getting strigger working correctly. 
My end goal sounds fairly simple: to get a mail notification if a node gets 
set into ‘drain’ mode. 


The man page for strigger states it must be run by the set slurmuser which 
is slurm:
#  scontrol show config | grep SlurmUser
SlurmUser   = slurm(990)



# grep slurm /etc/passwd
slurm:x:990:984:SLURM resource manager:/etc/slurm:/sbin/nologin



I created the file per the man page (I’m first trying to get it to work if 
a node goes down after receiving “option —drain does not exist”):
# cat /usr/sbin/slurm_admin_notify


#!/bin/bash
# Submit trigger for next event
 strigger --set --node --down \
 --program=/usr/sbin/slurm_admin_notify
# Notify administrator using by e-mail
/bin/mail 
oursitead...@ouremailserver.edu  -s 
NodesDown:$*

———
If I run manually, I receive:
slurm_set_trigger: Access/permission denied


If I add: “runuser -l slurm -c” in front  of the command strigger, I  
receive: 
This account is currently not available.


The man page also states: “Trigger events
 are not processed instantly, but a check is performed for trigger events 
on a periodic basis (currently every 15 seconds). “
This leads me to believe I am missing something possibly in my install for 
where is that 15 seconds set?


Any suggestions would be greatly appreciated! How are folks accomplishing 
this?
Thank you!
Jodie






Re: [slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-24 Thread Christopher Benjamin Coffey
Thanks Paul!

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/24/18, 12:26 PM, "slurm-users on behalf of Paul Edmon" 
 
wrote:

There was a bug filed on this here:


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5615data=02%7C01%7Cchris.coffey%40nau.edu%7Cee3c1c1892ba492870ca08d609f77245%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636707355720176333sdata=zMDdlf53FUFePgiICDI5VVjSXI8%2FToWMJisPIwPtOZg%3Dreserved=0

It will be fixed in 17.11.10.

-Paul Edmon-


On 08/24/2018 02:55 PM, Christopher Benjamin Coffey wrote:
> Odd that no one has this issue. Must be a site issue then? If so, can't 
think of what that would be. I suppose we may wait for .10 to be released where 
it looks like this may be corrected.
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>   
>
> On 8/20/18, 1:21 PM, "slurm-users on behalf of Christopher Benjamin 
Coffey"  wrote:
>
>  Hi,
>  
>  We've just recently installed slurm 17.11.9 and noticed an issue 
with sshare:
>  
>  sshare: error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/priority_multifactor.so): 
/usr/lib64/slurm/priority_multifactor.so: undefined symbol: sort_part_tier
>  sshare: error: Couldn't load specified plugin name for 
priority/multifactor: Dlopen of plugin file failed
>  sshare: error: cannot create priority context for 
priority/multifactor
>  
>  Seems to be caused by new line added to:
>  
>  src/plugins/priority/multifactor/priority_multifactor.c
>  
>  list_sort(job_ptr->part_ptr_list, sort_part_tier);
>  
>  
>  Looks like maybe this is a bug fixed in next slurm 17.11.10 ?
>  
>  
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2F67a82c369a7530ce7838e6294973af0082d8905b%23diff-34750311c0168c98230ff16d01ea5e9bdata=02%7C01%7Cchris.coffey%40nau.edu%7Cee3c1c1892ba492870ca08d609f77245%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636707355720176333sdata=0yhHwGsh1OiAgJOwYUDBFdRiudfht74WNFmx7yHesQo%3Dreserved=0
>  
>  
>  Anyone else run into this?
>  
>  Best,
>  Chris
>  
>  —
>  Christopher Coffey
>  High-Performance Computing
>  Northern Arizona University
>  928-523-1167
>   
>  
>  
>






Re: [slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-24 Thread Christopher Benjamin Coffey
Odd that no one has this issue. Must be a site issue then? If so, can't think 
of what that would be. I suppose we may wait for .10 to be released where it 
looks like this may be corrected.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 8/20/18, 1:21 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hi,

We've just recently installed slurm 17.11.9 and noticed an issue with 
sshare:

sshare: error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/priority_multifactor.so): 
/usr/lib64/slurm/priority_multifactor.so: undefined symbol: sort_part_tier
sshare: error: Couldn't load specified plugin name for 
priority/multifactor: Dlopen of plugin file failed
sshare: error: cannot create priority context for priority/multifactor

Seems to be caused by new line added to:

src/plugins/priority/multifactor/priority_multifactor.c

list_sort(job_ptr->part_ptr_list, sort_part_tier);


Looks like maybe this is a bug fixed in next slurm 17.11.10 ?


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2F67a82c369a7530ce7838e6294973af0082d8905b%23diff-34750311c0168c98230ff16d01ea5e9bdata=02%7C01%7Cchris.coffey%40nau.edu%7Ce8cf610d896b41f49c9408d606da8d54%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636703933073543447sdata=%2BjEpolYxCdpNkztTa3MqWMmNQ%2FRtTe%2Fluam2usrlzco%3Dreserved=0


Anyone else run into this?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 





[slurm-users] Slurm 17.11.9, sshare undefined symbol

2018-08-20 Thread Christopher Benjamin Coffey
Hi,

We've just recently installed slurm 17.11.9 and noticed an issue with sshare:

sshare: error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/priority_multifactor.so): 
/usr/lib64/slurm/priority_multifactor.so: undefined symbol: sort_part_tier
sshare: error: Couldn't load specified plugin name for priority/multifactor: 
Dlopen of plugin file failed
sshare: error: cannot create priority context for priority/multifactor

Seems to be caused by new line added to:

src/plugins/priority/multifactor/priority_multifactor.c

list_sort(job_ptr->part_ptr_list, sort_part_tier);


Looks like maybe this is a bug fixed in next slurm 17.11.10 ?

https://github.com/SchedMD/slurm/commit/67a82c369a7530ce7838e6294973af0082d8905b#diff-34750311c0168c98230ff16d01ea5e9b


Anyone else run into this?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



[slurm-users] Slurmstepd sleep processes

2018-08-03 Thread Christopher Benjamin Coffey
Hello,

Has anyone observed "sleep 1" processes on their compute nodes? They 
seem to be tied to the slurmstepd extern process in slurm:

4 S root 136777  1  0  80   0 - 73218 do_wai 05:48 ?00:00:01 
slurmstepd: [13220317.extern]
0 S root 136782 136777  0  80   0 - 25229 hrtime 05:48 ?00:00:00  
\_ sleep 1
4 S root 136784  1  0  80   0 - 73280 do_wai 05:48 ?00:00:02 
slurmstepd: [13220317.batch]
4 S tes87136789 136784  0  80   0 - 26520 do_wai 05:48 ?00:00:00  
\_ /bin/bash /var/spool/slurm/slurmd/job13220317/slurm_script
4 S root 136807  1  0  80   0 - 107157 do_wai 05:48 ?   00:00:01 
slurmstepd: [13220317.1]

I'm not exactly sure what the extern piece is for. Anyone know what this is all 
about? Is this normal? We just saw this the other day while investigating some 
issues. Sleeping for 3.17 years seems strange. Any help would be appreciated, 
thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



[slurm-users] Jobs blocking scheduling progress

2018-07-03 Thread Christopher Benjamin Coffey
Hello!

We are having an issue with high priority gpu jobs blocking low priority cpu 
only jobs.

Our cluster is setup with one partition, "all". All nodes reside in this 
cluster. In this all partition we have four generations of compute nodes, 
including gpu nodes. We do this to make use of those unused cores on the gpu 
nodes for compute only jobs. We schedule the various different generations, and 
gpu nodes by the user specifying a constraint (if they care), and a --qos=gpu / 
--gres=gpu:tesla:1 for gpu nodes. The gpu qos will give the jobs the highest 
priority in the queue, so that they can get scheduled sooner onto the limited 
resource that we have in gpu's. So this has worked out real nice for quite some 
time. But lately we've noticed that the gpu jobs are blocking the cpu only 
jobs. Yes, the gpu jobs have higher priority, yet, the gpu jobs can only run on 
a very small subset of nodes compared to the cpu only jobs. But it appears that 
slurm isn't taking into consideration the limited set of nodes the gpu job can 
run on. That’s the only possibility that I see to the gpu jobs blocking the cpu 
only jobs. I'm not sure if this is due to a recent slurm change, or if we just 
never noticed, but its definitely happening.  

For example, the behavior happens in the following scenario

- 15 compute nodes (no gpus) are idle
- All of the gpus are occupied
- 1000's of low priority compute only jobs in the pending queue
- 100's of highest priority gpu jobs in the pending queue

In the above scenario, the above low priority jobs are not backfilled, or 
started, yet compute only nodes remain idle. If I hold the gpu jobs, the lower 
priority compute only jobs are then started.

Anyone seen this? Am I thinking about this wrong? I would think that slurm 
should not be considering the nodes with no gpus to fulfill the gpu jobs.

I have an idea how to fix this scenario, but I think our current config should 
work. The fix I am mulling over is to create a gpu partition, and place the gpu 
nodes into that partition. Then, use the all_partitions job submit plugin to 
schedule compute only jobs into both partitions. The gpu jobs would then only 
land in the gpu partition. I'd think that would definitely fix the issue, but 
maybe there is a down side. Yet, I think how we have it should be working!?

Thanks for your advice!

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-11 Thread Christopher Benjamin Coffey
Hi Hadrian,

Thank you, unfortunately that is not the issue. We can connect to the nodes 
outside of slurm and have the X11 stuff work properly.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/7/18, 6:49 PM, "slurm-users on behalf of Hadrian Djohari" 
 wrote:

Hi,


I do not remember whether we had the same error message.
But, if the user's known_host has an old entry of the node he is trying to 
connect, the x11 won't connect properly.
Once the known_host entry has been deleted, the x11 connects just fine.


Hadrian


On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey
 wrote:

Hi,

I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the 
login node and get xeyes to work, etc. However, srun --x11 xeyes results in:

[cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: localhost:60.0
srun: error: cn100: task 0: Exited with exit code 1

On the node in slurmd.log it says:

[2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1
[2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job 11806306 
ran for 0 seconds
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: 
/slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: 
/slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB 
memsw.limit=1000MB
[2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established on 
DISPLAY=cn100:60.0
[2018-06-07T15:04:30.239] launch task 11806306.0 request from 
3301.3302@172.16.3.21 <mailto:3301.3302@172.16.3.21> (port 32453)
[2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit auto 
binding: cores,one_thread, dist 1
[2018-06-07T15:04:30.240] _task_layout_lllp_cyclic 
[2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: 
mask_cpu,one_thread, 0x001
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: 
/slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: 
/slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB 
memsw.limit=1000MB
[2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using 
sched_affinity for tasks
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: remote 
disconnected
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: exiting 
thread
[2018-06-07T15:04:30.376] [11806306.0] done with job
[2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown complete
[2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: oom-kill 
event count: 1
[2018-06-07T15:04:30.508] [11806306.extern] done with job

It seems like its close, as srun, and the node can agree on the port to 
connect on, but there is a difference between slurmd specifying the node name 
and port, where srun is trying to connect via localhost and the same port. 
Maybe I have an ssh setting wrong
 somewhere? I've tried all combinations I believe in ssh_config, and 
sshd_config. No issues with /home either, it’s a shared filesystem that each 
node mounts, and we even tried no_root_squash so root can write to the 
.Xauthority file like some folks have suggested.

Also, xauth list shows that there was no magic cookie written for host 
cn100:

[cbc@wind ~ ]$ xauth list
wind.hpc.nau.edu/unix:14 
<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwind.hpc.nau.edu%2Funix%3A14=02%7C01%7Cchris.coffey%40nau.edu%7Cff0e3e30539f4411850908d5cce220a0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636640193976928475=7RP3G%2FgProB9cc00B7XSeqRK12OGgmHYsbMRx4jBJs4%3D=0>
 
 MIT-MAGIC-COOKIE-1  ac4a0f1bfe9589806f81dd45306ee33d

Something preventing root from writing the magic cookie? The file is 
definitely writeable:

[root@cn100 ~]# touch /home/cbc/.Xauthority 
[root@cn100 ~]#

Anyone have any ideas? Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167









-- 
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490












[slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-07 Thread Christopher Benjamin Coffey
Hi,

I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the 
login node and get xeyes to work, etc. However, srun --x11 xeyes results in:

[cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: localhost:60.0
srun: error: cn100: task 0: Exited with exit code 1

On the node in slurmd.log it says:

[2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1
[2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job 11806306 ran 
for 0 seconds
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: 
/slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: 
/slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB 
memsw.limit=1000MB
[2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established on 
DISPLAY=cn100:60.0
[2018-06-07T15:04:30.239] launch task 11806306.0 request from 
3301.3302@172.16.3.21 (port 32453)
[2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit auto 
binding: cores,one_thread, dist 1
[2018-06-07T15:04:30.240] _task_layout_lllp_cyclic 
[2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: 
mask_cpu,one_thread, 0x001
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: 
/slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: 
/slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB 
memsw.limit=1000MB
[2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using sched_affinity 
for tasks
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: remote 
disconnected
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: exiting 
thread
[2018-06-07T15:04:30.376] [11806306.0] done with job
[2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown complete
[2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: oom-kill event 
count: 1
[2018-06-07T15:04:30.508] [11806306.extern] done with job

It seems like its close, as srun, and the node can agree on the port to connect 
on, but there is a difference between slurmd specifying the node name and port, 
where srun is trying to connect via localhost and the same port. Maybe I have 
an ssh setting wrong somewhere? I've tried all combinations I believe in 
ssh_config, and sshd_config. No issues with /home either, it’s a shared 
filesystem that each node mounts, and we even tried no_root_squash so root can 
write to the .Xauthority file like some folks have suggested.

Also, xauth list shows that there was no magic cookie written for host cn100:

[cbc@wind ~ ]$ xauth list
wind.hpc.nau.edu/unix:14  MIT-MAGIC-COOKIE-1  ac4a0f1bfe9589806f81dd45306ee33d

Something preventing root from writing the magic cookie? The file is definitely 
writeable:

[root@cn100 ~]# touch /home/cbc/.Xauthority 
[root@cn100 ~]#

Anyone have any ideas? Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Splitting mpi rank output

2018-05-14 Thread Christopher Benjamin Coffey
Thanks Chris! :)

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 5/10/18, 12:42 AM, "slurm-users on behalf of Chris Samuel" 
<slurm-users-boun...@lists.schedmd.com on behalf of ch...@csamuel.org> wrote:

On Thursday, 10 May 2018 2:25:49 AM AEST Christopher Benjamin Coffey wrote:

> I have a user trying to use %t to split the mpi rank outputs into 
different
> files and it's not working. I verified this too. Any idea why this might
> be? This is the first that I've heard of a user trying to do this.

I think they want to use that as an argument to srun, not sbatch.

I don't know why it doesn't work for sbatch, I'm guessing it doesn't get 
passed on in the environment?  From the look of the srun manual page it 
probably should set SLURM_STDOUTMODE.  But then you'd get both the batch 
output and rank 0 going to the first one.  Seems like a bug to me.

However, I can confirm that it works if you pass it to srun instead.

[csamuel@farnarkle1 tmp]$ cat test-rank.sh
#!/bin/bash
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=1

srun -o foo-%t.out hostname

[csamuel@farnarkle1 tmp]$ ls -ltr
total 264
-rw-rw-r-- 1 csamuel hpcadmin 89 May 10 17:34 test-rank.sh
-rw-rw-r-- 1 csamuel hpcadmin  0 May 10 17:34 slurm-127420.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-9.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-8.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-7.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-6.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-5.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-4.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-3.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-2.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-1.out
-rw-rw-r-- 1 csamuel hpcadmin  7 May 10 17:34 foo-0.out


[csamuel@farnarkle1 tmp]$ more foo-*
::
foo-0.out
::
john37
::
foo-1.out
::
john38
::
foo-2.out
::
john39
::
foo-3.out
::
john40
::
foo-4.out
::
john41
::
foo-5.out
::
john42
::
foo-6.out
::
john43
::
foo-7.out
::
john44
::
foo-8.out
::
john45
::
foo-9.out
::
john46

Hope that helps,
Chris
-- 
 Chris Samuel  :  
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=02%7C01%7Cchris.coffey%40nau.edu%7C61855c9177454600b81608d5b6498836%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C1%7C636615349292244096=TgEuyUOTdtVKxFsrFWSQo4Y9qwAzXY9lVk3pq0E6VQ0%3D=0
  :  Melbourne, VIC






[slurm-users] --uid , --gid option is root only now :'(

2018-05-10 Thread Christopher Benjamin Coffey
Hi,

We noticed that recently --uid, and --gid functionality changed where 
previously a user in the slurm administrators group could launch jobs 
successfully with --uid, and --gid , allowing for them to submit jobs as 
another user. Now, in order to use --uid, --gid, you have to be the root user.

What was the reasoning in making this change? Do people not trust the folks in 
the slurm administrator group to allow this behavior? Seems odd.

This bit us awhile back when upgrading from 16.05.6 to slurm 17.11 which has 
this --uid/--gid change in it. We've just recently gotten time to look into it. 
We've patched slurm (a very small change) to remove the check as we need this 
functionality. I'd imagine there wouldn't be any consequences from the minor 
change, but would like to hear if possible why the change was made and if this 
code change is a bad idea. Also, is there a better solution to allow a non-root 
slurm administrator user to submit jobs as another person?

slurm/src/sbatch/opt.c


case LONG_OPT_UID:
if (!optarg)
break;  /* Fix for Coverity false positive */
// remove the root only constraint for --uid
/*if (getuid() != 0) {
error("--uid only permitted by root user");
exit(error_exit);
}
*/
if (opt.euid != (uid_t) -1) {
error("duplicate --uid option");
exit(error_exit);
}
if (uid_from_string(optarg, ) < 0) {
error("--uid=\"%s\" invalid", optarg);
exit(error_exit);
}
break;

case LONG_OPT_GID:
if (!optarg)
break;  /* Fix for Coverity false positive */
// remove the root only constraint for --gid
/*if (getuid() != 0) {
error("--gid only permitted by root user");
exit(error_exit);
}*/
if (opt.egid != (gid_t) -1) {
error("duplicate --gid option");
exit(error_exit);
}
if (gid_from_string(optarg, ) < 0) {
error("--gid=\"%s\" invalid", optarg);
exit(error_exit);
}
break;


Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



[slurm-users] Runaway jobs issue, slurm 17.11.3

2018-05-09 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch of runaway jobs, but we 
cannot clear them:

sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

58588

Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, mysql, 
etc. but does not help.

This all started last week when slurm crashed due to being seriously hammered 
by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able to handle 
all the transactions that slurmctl was sending it:

...
[2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
[2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
[2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
[2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
[2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
...

...
[2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), 
RESTART SLURMDBD NOW
...

...
error: slurmdbd: Sending fini msg: No error
...

So now at this point, lots and lots of our nodes are idle, but slurm is not 
starting jobs. I'm thinking maybe slurmctl is confused and thinks all of those 
runaway jobs are still running.

I see that there is a fix for runaway jobs in version 17.11.5:

-- sacctmgr - fix runaway jobs identification.

Thinking about upgrading to see if this will fix our issue.

Hope maybe someone has run into this.

Thanks,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
We've gotten around the issue where we could not remove the runaway jobs. We 
had to go the manual route of manipulating the db directly. We actually used a 
great script that Loris Bennet wrote a while back. I haven't had to use it for 
a long while - thanks again! :)

An item of interest for the developers ... there seems to be a limit that we 
had exceeded that the "sacctmgr show runawayjobs" command or associated command 
could handle. After fixing a good portion of the jobs (~32K left) , "sacctmgr 
show runawayjobs" command did not display this error:

sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error 
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

And instead gave the dialog as normal "would you like to remove them ... (y/n)" 
, etc"

The limit appears to be here:

common/slurm_persist_conn.c
...
#define MAX_MSG_SIZE (16*1024*1024)
...

I wonder if that should be increased? It's probably not normal to have 56K 
runaway jobs, but still likely worthwhile addressing.

Anyhoo, it seems things are back to normal as far as we can tell. We will be 
looking into providing faster storage for the db but does it seem reasonable 
for slurm to crash under the circumstances that I mentioned?


Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 4/24/18, 10:20 AM, "slurm-users on behalf of Christopher Benjamin Coffey" 
<slurm-users-boun...@lists.schedmd.com on behalf of chris.cof...@nau.edu> wrote:

Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, 
but we cannot clear them:

sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

58588

Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, 
mysql, etc. but does not help.

Slurmdbd log shows the following when the "sacctmgr show runawayjobs" 
command is run.

[2018-04-24T07:56:03.869] error: Invalid msg_size (31302621) from 
connection 12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.872] error: Invalid msg_size (31302621) from 
connection 7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.874] error: Invalid msg_size (31302621) from 
connection 12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.875] error: Invalid msg_size (31302621) from 
connection 7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.877] error: Invalid msg_size (31302621) from 
connection 12(172.16.2.1) uid(3510)

Seems to indicate that possibly there are too many runaway jobs needing to 
be cleared? I wonder if there is a way to select a fewer number for removal. 
Don't see that option however.

This all started last week when slurm crashed due to being seriously 
hammered by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able 
to handle all the transactions that slurmctl was sending it:

...
[2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
[2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
[2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
[2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
[2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
...

...
[2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), 
RESTART SLURMDBD NOW
...

...
error: slurmdbd: Sending fini msg: No error
...

So now at this point, lots and lots of our nodes are idle, but slurm is not 
starting jobs.

[cbc@siris ~ ]$ sreport cluster utilization


Cluster Utilization 2018-04-23T00:00:00 - 2018-04-23T23:59:59
Usage reported in CPU Minutes


  Cluster Allocated Down PLND Dow Idle Reserved  Reported 
- -     - 
  monsoon   42163200000   4216320


sreport shows the entire cluster fully utilized, yet this is not the case.

I see that there is a fix for runaway jobs in version 17.11.5:

-- sacctmgr - fix runaway jobs identification.

We upgraded to 17.11.5 this morning but still we cannot clear the runaway 
jobs. I wonder if we'll need to manually remove them with some mysql foo. We 
are investigating this now.

Hope maybe someone has run into this.

Thanks,
Chris
—

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 





[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but 
we cannot clear them:

sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

58588

Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, mysql, 
etc. but does not help.

Slurmdbd log shows the following when the "sacctmgr show runawayjobs" command 
is run.

[2018-04-24T07:56:03.869] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.872] error: Invalid msg_size (31302621) from connection 
7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.874] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.875] error: Invalid msg_size (31302621) from connection 
7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.877] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)

Seems to indicate that possibly there are too many runaway jobs needing to be 
cleared? I wonder if there is a way to select a fewer number for removal. Don't 
see that option however.

This all started last week when slurm crashed due to being seriously hammered 
by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able to handle 
all the transactions that slurmctl was sending it:

...
[2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
[2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
[2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
[2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
[2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
...

...
[2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), 
RESTART SLURMDBD NOW
...

...
error: slurmdbd: Sending fini msg: No error
...

So now at this point, lots and lots of our nodes are idle, but slurm is not 
starting jobs.

[cbc@siris ~ ]$ sreport cluster utilization

Cluster Utilization 2018-04-23T00:00:00 - 2018-04-23T23:59:59
Usage reported in CPU Minutes

  Cluster Allocated Down PLND Dow Idle Reserved  Reported 
- -     - 
  monsoon   42163200000   4216320


sreport shows the entire cluster fully utilized, yet this is not the case.

I see that there is a fix for runaway jobs in version 17.11.5:

-- sacctmgr - fix runaway jobs identification.

We upgraded to 17.11.5 this morning but still we cannot clear the runaway jobs. 
I wonder if we'll need to manually remove them with some mysql foo. We are 
investigating this now.

Hope maybe someone has run into this.

Thanks,
Chris
—

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-26 Thread Christopher Benjamin Coffey
Good thought Chris. Yet in our case our system does not have the 
spectre/meltdown kernel fix.

Just to update everyone, we performed the upgrade successfully after we purged 
more data jobs/steps first. We did the following to ensure the purge happened 
right away per Hendryk's recommendation:

ArchiveDir=/common/adm/slurmdb_archive
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveSteps=no
ArchiveResvs=no
ArchiveSuspend=no
PurgeEventAfter=1month
PurgeJobAfter=2880hours # <- changing from 18months
PurgeResvAfter=2month
PurgeStepAfter=2880hours# <- changing from 18 months
PurgeSuspendAfter=2month

Specifying hours so that the purge kicked in quickly.

After having only 4 months of jobs/steps, the update took ~1.5 hrs. This was 
for a 334MB db with 1.1M jobs.

We've also made this change to stop tracking energy and other things we don't 
care about right now:

AccountingStorageTRES=cpu,mem

Hope that will help for the future.

Thanks Hendryk, and Paul :)

Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
On 2/23/18, 3:13 PM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On Friday, 23 February 2018 8:04:50 PM AEDT Miguel Gila wrote:

> Interestingly enough, a poor vmare VM (2CPUs, 3GB/RAM) with MariaDB 5.5.56
> outperformed our central MySQL 5.5.59 (128GB, 14core, SAN) by a factor of
> at least 3 on every table conversion.

Wild idea completely out of left field..

Does the production system have the updates for Meltdown and Spectre 
applied, 
whereas the VM setup does not?

There are meant to be large impacts from those fixes for syscall heavy 
applications and databases are one of those nightmare cases...

cheers,
Chris
-- 
 Chris Samuel  :  
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=02%7C01%7Cchris.coffey%40nau.edu%7C392a2dc2bbde477e503208d57b0aa473%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636550207997139489=EdGJVxEIu5K%2Bi2yE2pxKFx7t%2BWmiwNtr6ufchjeHzPc%3D=0
  :  Melbourne, VIC






Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Christopher Benjamin Coffey
Hi Loris,

"But that's only the case if the program is started with srun or some form 
of mpirun.  Otherwise the program just gets started once on one core and 
the other cores just idle."

Yes, maybe that’s true about what you say when not using srun. I'm not sure, as 
we tell everyone to use srun to launch every type of task.

Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 2/22/18, 8:25 AM, "slurm-users on behalf of Loris Bennett" 
<slurm-users-boun...@lists.schedmd.com on behalf of loris.benn...@fu-berlin.de> 
wrote:

Hi, Other Chris,
    
Christopher Benjamin Coffey <chris.cof...@nau.edu> writes:

> Loris,
>
> It’s simple, tell folks only to use -n for mpi jobs, and -c otherwise 
(default). 
>
> It’s a big deal if folks use -n when it’s not an mpi program. This is
> because the non mpi program is launched n times (instead of once with
> internal threads) and will stomp over logs and output files
> (uncoordinated) leading to poor performance and incorrect results.

But that's only the case if the program is started with srun or some
form of mpirun.  Otherwise the program just gets started once on one
core and the other cores just idle.  However, I could argue that this is
worse than starting multiple instances, because the user might think
everything is OK and go on wasting resources.

So maybe it is a good ideas to tell users that if they don't know what
MPI is, then they should forget about multiple tasks and just set
--tasks-per-cpu for the default 1 task.  That way I wouldn't get users
running a single-threaded application on 100 cores (the 1000-core job
got stuck in the queue :-/ ).

I think I'm convinced.

Cheers,

Loris

[snip (53 lines)]

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de





Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-22 Thread Christopher Benjamin Coffey
Thanks Paul. I didn't realize we were tracking energy ( . Looks like the best 
way to stop tracking energy is to specify what you want to track with 
AccountingStorageTRES ? I'll give that a try.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 2/22/18, 8:18 AM, "slurm-users on behalf of Paul Edmon" 
<slurm-users-boun...@lists.schedmd.com on behalf of ped...@cfa.harvard.edu> 
wrote:

Typically the long db upgrades are only for major version upgrades.  
Most of the time minor versions don't take nearly as long.

At least with our upgrade from 17.02.9 to 17.11.3 the upgrade only took 
1.5 hours with 6 months worth of jobs (about 10 million jobs).  We don't 
track energy usage though so perhaps we avoided that particular query 
due to that.

 From past experience these major upgrades can take quite a bit of time 
as they typically change a lot about the DB structure in between major 
versions.

-Paul Edmon-

On 02/22/2018 06:17 AM, Malte Thoma wrote:
> FYI:
> * We broke our upgrade from 17.02.1-2 to 17.11.2 after about 18 h.
> * Dropped the job table ("truncate xyz_job_table;")
> * Executed the everlasting sql command by hand on a back-up database
> * Meanwhile we did the slurm upgrade (fast)
> * Reset the First-Job-ID to a high number
> * Inserted the converted datatable in the real database again.
>
> It took two experts for this task and we would appreciate a better 
> upgrade-concept very much!
> I fact, we hesitate to upgrade from 17.11.2  to 17.11.3, because we 
> are afraid of similar problems. Does anyone has experience with this?
>
> It would be good to know if there is ANY chance if future upgrades 
> will cause the same problems or if this will become better?
>
> Regards,
> Malte
>
>
    >
>
>
>
> Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey:
>> This is great to know Kurt. We can't be the only folks running into 
>> this.. I wonder if the mysql update code gets into a deadlock or 
>> something. I'm hoping a slurm dev will chime in ...
>>
>> Kurt, out of band if need be, I'd be interested in the details of 
>> what you ended up doing.
>>
>> Best,
>> Chris
>>
>> —
>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167
>>
>> On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" 
>> <slurm-users-boun...@lists.schedmd.com on behalf of k...@sciops.net> 
>> wrote:
>>
>>  On Wed, Feb 21, 2018 at 11:56:38PM +, Christopher Benjamin 
>> Coffey wrote:
>>  > Hello,
>>  >
>>  > We have been trying to upgrade slurm on our cluster from 
>> 16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades 
>> have been a breeze, and I believe during the last one, the db upgrade 
>> took like 25 minutes. Well now, the db upgrade process is taking far 
>> too long. We previously attempted the upgrade during a maintenance 
>> window and the upgrade process did not complete after 24 hrs. I gave 
>> up on the upgrade and reverted the slurm version back by restoring a 
>> backup db.
>>   We hit this on our try as well: upgrading from 17.02.9 to 
>> 17.11.3.  We
>>  truncated our job history for the upgrade, and then did the rest 
>> of the
>>  conversion out-of-band and re-imported it after the fact. It 
>> took us
>>  almost sixteen hours to convert a 1.5 million-job store.
>>   We got hung up on precisely the same query you did, on a 
>> similarly hefty
>>  machine.  It caused us to roll back an upgrade and try again 
>> during our
>>  subsequent maintenance window with the above approach.
>>   khm
>>
>






Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Christopher Benjamin Coffey
Loris,

It’s simple, tell folks only to use -n for mpi jobs, and -c otherwise 
(default). 

It’s a big deal if folks use -n when it’s not an mpi program. This is because 
the non mpi program is launched n times (instead of once with internal threads) 
and will stomp over logs and output files (uncoordinated) leading to poor 
performance and incorrect results. 

Best,
Chris
> On Feb 22, 2018, at 1:52 AM, Loris Bennett  wrote:
> 
> Hi Chris,
> 
> Christopher Samuel  writes:
> 
>>> On 22/02/18 18:49, Miguel Gutiérrez Páez wrote:
>>> 
>>> What's the real meaning of ntasks? Has cpus-per-task and ntasks the
>>> same meaning in sbatch and srun?
>> 
>> --ntasks is for parallel distributed jobs, where you can run lots of
>> independent processes that collaborate using some form of communication
>> between the processes (usually MPI for HPC).
>> 
>> So inside your batch script you would use "srun" to start up the tasks.
>> 
>> However, unless you code is written to make use of that interface then
>> it's not really going to help you, and so for any multithreaded
>> application you need to use --cpus-per-task instead.
> 
> [snip (11 lines)]
> 
> But does it make any difference for a multithreaded program if I have
> 
>  #SBATCH --ntasks=4
>  #SBATCH --nodes=1-1
> 
> rather than
> 
>  #SBATCH --ntasks=1
>  #SBATCH --cpus-per-task=4
> 
> Up to now I have only thought of --cpus-per-task in connection with
> hybrid MPI/OpenMP jobs, which we don't actually have.  Thus I tend to
> tell users to think always in terms of tasks, regardless of whether
> these are MPI processes or just threads.
> 
> One downside of my approach is that if the user forgets to specify
> --nodes and --ntasks is greater than 1, non-MPI jobs can be assigned to
> multiple nodes.
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
> 


[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-21 Thread Christopher Benjamin Coffey
Hello,

We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3. 
I'm thinking this should be doable? Past upgrades have been a breeze, and I 
believe during the last one, the db upgrade took like 25 minutes. Well now, the 
db upgrade process is taking far too long. We previously attempted the upgrade 
during a maintenance window and the upgrade process did not complete after 24 
hrs. I gave up on the upgrade and reverted the slurm version back by restoring 
a backup db.

Since the failed attempt at the upgrade, I've archived a bunch of jobs as we 
had 4 years of jobs in the database. Now only keeping last 1.5 years worth. 
This reduced our db size down from 3.7GB to 1.1GB. We are now archiving jobs 
regularly through slurm.

I've finally had time to look at this a bit more and we've restored the reduced 
database onto another system to test the upgrade process in a dev environment, 
hoping to prove that the slimmed down db will upgrade within a reasonable 
amount of time. Yet, the current upgrade on this dev system has already taken 
20 hrs. The database has 1.8M jobs. That doesn't seem like that many jobs!

The conversion is stuck on this command:

update "monsoon_job_table" as job left outer join ( select job_db_inx, 
SUM(consumed_energy) 'sum_energy' from "monsoon_step_table" where id_step >= 0 
and consumed_energy != 18446744073709551614 group by job_db_inx ) step on 
job.job_db_inx=step.job_db_inx set job.tres_alloc=concat(job.tres_alloc, 
concat(',3=', case when step.sum_energy then step.sum_energy else 
18446744073709551614 END)) where job.tres_alloc != '' && job.tres_alloc not 
like '%,3=%':

The system is no slouch:

28 core, E5-2680 v4 2.4GHz
SSD
128GB memory

Anyone have this issue? Anyone have a suggestion? This seems like a ridiculous 
amount of time needed to perform the upgrade! The database is healthy as far as 
I see. No errors in the slurmdbd log, etc.

Let me know if you need more info!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167