Re: [slurm-users] User id inconsistency

2021-04-19 Thread Dustin Lang
Sorry for my confusion, I shouldn't try to write emails before coffee!


On Mon, Apr 19, 2021 at 7:43 AM Bruno Gomes Pessanha <
bruno.pessa...@gmail.com> wrote:

> That is showing that I'm in different groups depending on how I run the
> command id.
>
> PS: I'm running the controller and workers in docker containers using
> privileged mode.
>
> Bruno
>
> On Mon, 19 Apr 2021 at 13:24, Dustin Lang  wrote:
>
>> This is telling you you're root in the docker container, right?
>>
>>
>>
>> On Mon, Apr 19, 2021 at 4:51 AM Bruno Gomes Pessanha <
>> bruno.pessa...@gmail.com> wrote:
>>
>>> Somebody could help me with this?
>>> Pretty strange behaviour. If I run "id: it shows different groups if I
>>> run "id myuser":
>>>
>>> [root@ctrl-slurm /]# srun --pty -p local --uid myuser bash
>>>
>>> [myuser@node-slurm /]$ id
>>> uid=868295925(myuser) gid=0(root) groups=0(root),979(cgred)
>>>
>>> [myuser@node-slurm /]$ id myuser
>>> uid=868295925(myuser) gid=1001(myuser) groups=1001(myuser),978(docker)
>>>
>>> --
>>> Bruno
>>>
>>
>
> --
> Bruno Gomes Pessanha
>


Re: [slurm-users] User id inconsistency

2021-04-19 Thread Dustin Lang
This is telling you you're root in the docker container, right?



On Mon, Apr 19, 2021 at 4:51 AM Bruno Gomes Pessanha <
bruno.pessa...@gmail.com> wrote:

> Somebody could help me with this?
> Pretty strange behaviour. If I run "id: it shows different groups if I run
> "id myuser":
>
> [root@ctrl-slurm /]# srun --pty -p local --uid myuser bash
>
> [myuser@node-slurm /]$ id
> uid=868295925(myuser) gid=0(root) groups=0(root),979(cgred)
>
> [myuser@node-slurm /]$ id myuser
> uid=868295925(myuser) gid=1001(myuser) groups=1001(myuser),978(docker)
>
> --
> Bruno
>


Re: [slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-07 Thread Dustin Lang
According to a very quick web search, migrating from MySQL to MariaDB is
(very) easy.  Does anyone have any counter-experience with Slurm databases?

Thanks,
--dustin


On Thu, May 7, 2020 at 1:34 PM Christopher Samuel  wrote:

> On 5/7/20 6:08 AM, Riebs, Andy wrote:
>
> > Alternatively, you could switch to MariaDB; I've been using that for
> years.
>
> Debian switched to only having MariaDB in 2017 with the release of
> Debian 9 (Stretch), as a derivative distro I'm surprised that Ubuntu
> still packages MySQL.
>
> I'd second Andy's suggestion.
>
> All the best,
> Chris
> --
>Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>


[slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-06 Thread Dustin Lang
Hi,

Ubuntu has made mysql 5.7.30 the default version.  At least with Ubuntu
16.04, this causes severe problems with Slurm dbd (v 17.x, 18.x, and 19.x;
not sure about 20).  Reverting to mysql 5.7.29 seems to make everything
work okay again.

cheers,
--dustin


[slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-05 Thread Dustin Lang
Hi,

I've just upgraded to slurm 19.05.5.

With either my old database, OR creating an entirely new database, I am
unable to create a new 'cluster' entry in the database -- slurmdbd is
segfaulting!

# sacctmgr add cluster test3
 Adding Cluster(s)
  Name   = test3
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to mn001:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type:
DBD_ADD_CLUSTERS
 Problem adding clusters: Unspecified error
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused

Meanwhile, running "slurmdbd -D -v -v -v -v -v", I see

[2020-05-05T18:17:19.503] debug4: 10(as_mysql_cluster.c:405) query
insert into txn_table (timestamp, action, name, actor, info) values
(1588717037, 1405, 'test3', 'root', 'mod_time=1588717037, shares=1,
grp_jobs=NULL, grp_jobs_accrue=NULL, grp_submit_jobs=NULL, grp_wall=NULL,
max_jobs=NULL, max_jobs_accrue=NULL, min_prio_thresh=NULL,
max_submit_jobs=NULL, max_wall_pj=NULL, priority=NULL, def_qos_id=NULL,
qos=\',1,\', federation=\'\', fed_id=0, fed_state=0, features=\'\'');
slurmdbd: debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
slurmdbd: debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
Segmentation fault (core dumped)


Since this happens on a fresh new database, I just don't understand how I
can get back to a basic functional state.  This is exceedingly frustrating.

Thanks for any hints.

--dustin


Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation
Fault!



On Tue, May 5, 2020 at 2:39 PM Dustin Lang  wrote:

> Hi,
>
> Apparently my colleague upgraded the mysql client and server, but, as far
> as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql
> release notes I  don't see anything that looks suspicious there...
>
> cheers,
> --dustin
>
>
> On Tue, May 5, 2020 at 1:37 PM Dustin Lang  wrote:
>
>> Hi,
>>
>> We're running Slurm 17.11.12.  Everything has been working fine, and then
>> suddenly slurmctld is crashing and slurmdbd is crashing.
>>
>> We use fair-share as part of the queuing policy, and previously set up
>> accounts with sacctmgr; that has been working fine for months.
>>
>> If I run slurmdbd in debug mode,
>>
>>  slurmdbd -D -v -v -v -v -v
>>
>> it eventually (after being contacted by slurmctld) segfaults with:
>>
>> ...
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_TRES: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_QOS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_USERS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_ASSOCS: called
>> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
>> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
>> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
>> @delta_qos;
>> Segmentation fault (core dumped)
>>
>>
>> It looks (running slurmdbd in gdb) like that segfault is coming from
>>
>>
>> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073
>>
>> and If I connect to the mysql database directly and call that stored
>> procedure, I get
>>
>> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>>
>> +-+-+-+--+---+-+-+-+-+--+-+-+
>> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
>> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
>> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
>> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
>> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
>> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
>> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
>> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
>> parent_acct |
>>
>> +-+-+-+--+---+-+-+-+-+--+-+-+
>> |   1 |NULL |NULL |
>>   NULL |  NULL | ,1, | NULL
>>| NULL
>>  | NULL
>>| NULL
>>
>>   | NULL
>>  | |
>>
>> +-+-+-+--+---+-+---

Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
Hi,

Apparently my colleague upgraded the mysql client and server, but, as far
as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql
release notes I  don't see anything that looks suspicious there...

cheers,
--dustin


On Tue, May 5, 2020 at 1:37 PM Dustin Lang  wrote:

> Hi,
>
> We're running Slurm 17.11.12.  Everything has been working fine, and then
> suddenly slurmctld is crashing and slurmdbd is crashing.
>
> We use fair-share as part of the queuing policy, and previously set up
> accounts with sacctmgr; that has been working fine for months.
>
> If I run slurmdbd in debug mode,
>
>  slurmdbd -D -v -v -v -v -v
>
> it eventually (after being contacted by slurmctld) segfaults with:
>
> ...
> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
> TIME:1588695584
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
> TIME:1588695584
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_TRES: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_QOS: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_USERS: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_ASSOCS: called
> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
> @delta_qos;
> Segmentation fault (core dumped)
>
>
> It looks (running slurmdbd in gdb) like that segfault is coming from
>
>
> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073
>
> and If I connect to the mysql database directly and call that stored
> procedure, I get
>
> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>
> +-+-+-+--+---+-+-+-+-+--+-+-+
> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
> parent_acct |
>
> +-+-+-+--+---+-+-+-+-+--+-+-+
> |   1 |NULL |NULL |
>   NULL |  NULL | ,1, | NULL
>| NULL
>  | NULL
>| NULL
>
>   | NULL
>  | |
>
> +-+-+-+--+---+-+-+-+-+--+

[slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
Hi,

We're running Slurm 17.11.12.  Everything has been working fine, and then
suddenly slurmctld is crashing and slurmdbd is crashing.

We use fair-share as part of the queuing policy, and previously set up
accounts with sacctmgr; that has been working fine for months.

If I run slurmdbd in debug mode,

 slurmdbd -D -v -v -v -v -v

it eventually (after being contacted by slurmctld) segfaults with:

...
slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
TIME:1588695584
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
TIME:1588695584
slurmdbd: debug4: got 0 commits
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_TRES: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_QOS: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_USERS: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_ASSOCS: called
slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
@par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
@delta_qos;
Segmentation fault (core dumped)


It looks (running slurmdbd in gdb) like that segfault is coming from

https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073

and If I connect to the mysql database directly and call that stored
procedure, I get

mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
+-+-+-+--+---+-+-+-+-+--+-+-+
| @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
:= max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
| @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
parent_acct |
+-+-+-+--+---+-+-+-+-+--+-+-+
|   1 |NULL |NULL |
NULL |  NULL | ,1, | NULL
 | NULL
   | NULL
 | NULL

| NULL
   | |
+-+-+-+--+---+-+-+-+-+--+-+-+

and if I run

mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos;

I get

+-+--+--+---+---+---++---+-+--++
| @par_id | @mj  | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm |
@def_qos_id | @qos | @delta_qos |
+-+--+--+---+---+---++---+-+--++
|   1 | NULL | NULL |  NULL | NULL  | NULL  | NULL   | NULL  |
 NULL | ,1,  | NULL   |