So it looks like sshare was updating information for about a day and
then stopped. My trick of restating slurm and slurmdbd doesn't seem to
working now.
Again, the problem I have is that slurmdbd seems to be working fine
and I can query the stats on old jobs but there is no change in
sshare. For example, before I run a job I have this:
[tim@olympus ~]$ sshare -u tim | egrep "Shares|ops"
Account User Raw Shares Norm Shares Raw Usage
Effectv Usage FairShare
ops 1 0.008475 9667831189
0.037213 0.047656
ops tim 1 0.000530 5186340442
0.021041 0.000000
[tim@olympus ~]$ sbatch sbatch.test
Submitted batch job 1732513
sbatch.test is a script that gets 100 nodes (3200 cores) and sleeps
for 60 seconds.
[tim@olympus ~]$ squeue -u tim
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1732513 pal sbatch.t tim R 0:45 100
node[0162-0179,0407-0488]
After it runs, sacct is well aware of the stats for this job
[tim@olympus ~]$ sacct -j 1732513 -A ops --format="nnodes,elapsed"
NNodes Elapsed
-------- ----------
100 00:01:00
1 00:01:00
Yet sshare has not updated any information
[tim@olympus ~]$ sshare -u tim | egrep "Shares|ops"
Account User Raw Shares Norm Shares Raw Usage
Effectv Usage FairShare
ops 1 0.008475 9667831189
0.037213 0.047656
ops tim 1 0.000530 5186340442
0.021041 0.000000
I'm stumped... Should have brought this up in the BoF ;-) Any hints
would be appreciated.
Thanks
Tim
On Fri, Nov 9, 2012 at 10:28 AM, Tim Carlson <[email protected]> wrote:
>
> One final (I think) update. I turned up debugging to debug4 on
>
> slurmdbd and restarted things.
>
>
>
> service slurm stop
>
> service slurmdbd stop
>
> service slurmdbd start
>
> service slurm start
>
>
>
> Now with debug4 on (and no other at all) I get some other interesting
>
> error messages, but slurm is able to register the cluster correctly
>
> and instead of getting a malformed mysql query, I get the expected
>
> result.
>
>
>
> [2012-11-09T09:29:06] DBD_JOB_COMPLETE: cluster not registered
>
> [2012-11-09T09:29:06] debug3: 8(accounting_storage_mysql.c:2645) query
>
> update cluster_table set control_host='172.16.0.1', control_port=6817
>
> where name='olympus';
>
> [2012-11-09T09:29:06] debug2: DBD_JOB_COMPLETE: ID:1691563
>
> [2012-11-09T09:29:06] debug2: as_mysql_slurmdb_job_complete() called
>
> [2012-11-09T09:29:06] debug3: 8(as_mysql_job.c:761) query
>
> update "olympus_job_table" set time_end=1352482081, state=3,
>
> nodelist='node[0001-0008]', derived_ec=0, exit_code=0, kill_requid=-1
>
> where job_db_inx=1695179;
>
> [2012-11-09T09:29:06] debug2: DBD_REGISTER_CTLD: called for olympus(6817)
>
> [2012-11-09T09:29:06] debug2: slurmctld at ip:172.16.0.1, port:6817
>
> [2012-11-09T09:29:06] debug3: 8(as_mysql_cluster.c:407) query
>
> select name, control_port from cluster_table where deleted=0 &&
>
> (name='olympus');
>
>
>
> sshare is now being updated and everything seems happy again. I still
>
> lost about 3 weeks worth of accounting information, but I can live
>
> with that.
>
>
>
> I'm a bit concerned that all I did was change the debug level.
>
>
>
> Tim
>
>
>
>
>
> On Fri, Nov 9, 2012 at 7:33 AM, Tim Carlson <[email protected]> wrote:
>
>> I did a bit more digging to see if I could figure this out. My
>
>> assumption is that I am missing a configuration parameter somewhere.
>
>> So I was trying to find where the mysql query is being incorrectly
>
>> formed and figured it must be in.
>
>>
>
>> src/plugins/accounting_storage/mysql/accounting_storage_mysql.c
>
>>
>
>> But I can't figure out where this query would be getting put together.
>
>> There are bits and pieces of the query being formed, but I can't pin
>
>> down the exact line with the error.
>
>>
>
>> Like I said it is really strange because the mysql tables are being
>
>> updated as jobs run and I can query all the past jobs but the sshare
>
>> information is not changing.
>
>>
>
>> Tim
>
>>
>
>> On Thu, Nov 8, 2012 at 2:57 PM, Tim Carlson <[email protected]> wrote:
>
>>> Just upgrade SLURM on our cluster from 2.2.7 to 2.4.3 and now realized
>
>>> (a couple of weeks later) that my account for jobs is broken.
>
>>>
>
>>> I use
>
>>>
>
>>> AccountingStorageType=accounting_storage/slurmdbd
>
>>>
>
>>> And I see this in the logs when starting up slurmdbd
>
>>>
>
>>> [2012-11-08T14:43:09] DBD_JOB_COMPLETE: cluster not registered
>
>>> [2012-11-08T14:43:09] error: accounting_storage_mysql.c:2612 no cluster name
>
>>> [2012-11-08T14:43:10] error: mysql_query failed: 1064 You have an
>
>>> error in your SQL syntax; check the manual that corresponds to your
>
>>> MySQL server version for the right syntax to use near ')' at line 1
>
>>> select name, control_port from cluster_table where deleted=0 && ();
>
>>> [2012-11-08T14:43:10] error: no result given for where deleted=0 && ()
>
>>> [2012-11-08T14:43:10] error: Processing last message from connection
>
>>> 10(172.16.0.1) uid(500)
>
>>> [2012-11-08T14:43:10] error: We should have gotten a new id: Table
>
>>> 'slurm_acct_db.(null)_job_table' doesn't exist
>
>>> [2012-11-08T14:43:10] error: It looks like the storage has gone away
>
>>> trying to reconnect
>
>>> [2012-11-08T14:43:10] error: We should have gotten a new id: Table
>
>>> 'slurm_acct_db.(null)_job_table' doesn't exist
>
>>> [2012-11-08T14:43:10] DBD_JOB_START: cluster not registered
>
>>>
>
>>> Mysql is the backend database type. It seems like I missed a step when
>
>>> upgrading from 2.2.7 to 2.4.3 but I can't figure out what it would be.
>
>>>
>
>>> sacctmgr seems to think the cluster is registered
>
>>>
>
>>> # sacctmgr list cluster
>
>>> Cluster ControlHost ControlPort RPC Share GrpJobs GrpNodes
>
>>> GrpSubmit MaxJobs MaxNodes MaxSubmit MaxWall QOS
>
>>> Def QOS
>
>>> ---------- --------------- ------------ --- --------- ------- --------
>
>>> --------- ------- -------- --------- ----------- --------------------
>
>>> ---------
>
>>> olympus 172.16.0.1 6817 10 1
>
>>> normal
>
>>>
>
>>>
>
>>> The queue runs just and sacct shows me all the jobs that have run but
>
>>> I'm not getting any updates to sshare which I use for accounting
>
>>> purposes with sbank. Any ideas?
>
>>>
>
>>> Thanks
>
>>>
>
>>> Tim