Hi all,
We had to restart the slurmdbd service on one of our clusters running Slurm
17.11.7 yesterday, since folks were experiencing errors with job scheduling,
and running 'sacct':
-----
$ sacct -X -p -o
jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,AllocTRES%-50
-s R --allusers
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent
connection to captain1:6819: Connection refused
sacct: error: slurmdbd: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
-----
Looking in the logs post-restart, I see a large number of messages such as
these:
-----
[2019-05-07T07:35:17.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start
1555628209, and cluster 'rescluster'
-----
I read today's list message entitled "Slurm database failure messages", and
although different, I saw that there was a linked bug report that had to do
with problems with reservations. It suggested gathering data via three
commands, the output of which from our cluster are seen here:
-----
root@captain1:/var/log# scontrol show reservations
ReservationName=res17-pc2 StartTime=2019-02-25T14:58:40
EndTime=2029-02-22T14:58:40 Duration=3650-00:00:00
Nodes=res17-pc2 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=desktops
Flags=SPEC_NODES
TRES=cpu=12
Users=samuel Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
Watts=n/a
ReservationName=res18-pc5 StartTime=2019-04-25T11:47:05
EndTime=2020-04-24T11:47:05 Duration=365-00:00:00
Nodes=res18-pc5 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
TRES=cpu=12
Users=grv Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
Watts=n/a
root@captain1:/var/log# sacctmgr show reservations
Cluster Name TRES TimeStart
TimeEnd UnusedWall
---------- --------------- ------------------------------ -------------------
------------------- ----------
rescluster res17-pc2 cpu=12 2019-04-24T13:29:52
2029-02-22T14:58:40 0.000000
mysql> select * from rescluster_resv_table\G
*************************** 1. row ***************************
id_resv: 1
deleted: 1
assoclist: 12
flags: 65535
nodelist: res17-pc2,captain2,server13k,server15k,server25k
node_inx: 0-4
resv_name: res17-pc2
time_start: 1551135476
time_end: 1551135512
tres: 1=140
unused_wall: 36
*************************** 2. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551135520
time_end: 1551141705
tres: 1=12
unused_wall: 6176.5
*************************** 3. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551141705
time_end: 1551734095
tres: 1=12
unused_wall: 581590
*************************** 4. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551734095
time_end: 1551847812
tres: 1=12
unused_wall: 117173.666667
*************************** 5. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551847812
time_end: 1552353438
tres: 1=12
unused_wall: 480521
*************************** 6. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1552353438
time_end: 1554771615
tres: 1=12
unused_wall: 2367043
*************************** 7. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1554771615
time_end: 1556137792
tres: 1=12
unused_wall: 2006236
*************************** 8. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1556137792
time_end: 1866495520
tres: 1=12
unused_wall: 0
8 rows in set (0.00 sec)
-----
So it seems to me that the reservations are messed up; how to go about fixing
this?
Thanks in advance for any help provided...