One final (I think) update. I turned up debugging to debug4 on slurmdbd and restarted things.
service slurm stop service slurmdbd stop service slurmdbd start service slurm start Now with debug4 on (and no other at all) I get some other interesting error messages, but slurm is able to register the cluster correctly and instead of getting a malformed mysql query, I get the expected result. [2012-11-09T09:29:06] DBD_JOB_COMPLETE: cluster not registered [2012-11-09T09:29:06] debug3: 8(accounting_storage_mysql.c:2645) query update cluster_table set control_host='172.16.0.1', control_port=6817 where name='olympus'; [2012-11-09T09:29:06] debug2: DBD_JOB_COMPLETE: ID:1691563 [2012-11-09T09:29:06] debug2: as_mysql_slurmdb_job_complete() called [2012-11-09T09:29:06] debug3: 8(as_mysql_job.c:761) query update "olympus_job_table" set time_end=1352482081, state=3, nodelist='node[0001-0008]', derived_ec=0, exit_code=0, kill_requid=-1 where job_db_inx=1695179; [2012-11-09T09:29:06] debug2: DBD_REGISTER_CTLD: called for olympus(6817) [2012-11-09T09:29:06] debug2: slurmctld at ip:172.16.0.1, port:6817 [2012-11-09T09:29:06] debug3: 8(as_mysql_cluster.c:407) query select name, control_port from cluster_table where deleted=0 && (name='olympus'); sshare is now being updated and everything seems happy again. I still lost about 3 weeks worth of accounting information, but I can live with that. I'm a bit concerned that all I did was change the debug level. Tim On Fri, Nov 9, 2012 at 7:33 AM, Tim Carlson <[email protected]> wrote: > I did a bit more digging to see if I could figure this out. My > assumption is that I am missing a configuration parameter somewhere. > So I was trying to find where the mysql query is being incorrectly > formed and figured it must be in. > > src/plugins/accounting_storage/mysql/accounting_storage_mysql.c > > But I can't figure out where this query would be getting put together. > There are bits and pieces of the query being formed, but I can't pin > down the exact line with the error. > > Like I said it is really strange because the mysql tables are being > updated as jobs run and I can query all the past jobs but the sshare > information is not changing. > > Tim > > On Thu, Nov 8, 2012 at 2:57 PM, Tim Carlson <[email protected]> wrote: >> Just upgrade SLURM on our cluster from 2.2.7 to 2.4.3 and now realized >> (a couple of weeks later) that my account for jobs is broken. >> >> I use >> >> AccountingStorageType=accounting_storage/slurmdbd >> >> And I see this in the logs when starting up slurmdbd >> >> [2012-11-08T14:43:09] DBD_JOB_COMPLETE: cluster not registered >> [2012-11-08T14:43:09] error: accounting_storage_mysql.c:2612 no cluster name >> [2012-11-08T14:43:10] error: mysql_query failed: 1064 You have an >> error in your SQL syntax; check the manual that corresponds to your >> MySQL server version for the right syntax to use near ')' at line 1 >> select name, control_port from cluster_table where deleted=0 && (); >> [2012-11-08T14:43:10] error: no result given for where deleted=0 && () >> [2012-11-08T14:43:10] error: Processing last message from connection >> 10(172.16.0.1) uid(500) >> [2012-11-08T14:43:10] error: We should have gotten a new id: Table >> 'slurm_acct_db.(null)_job_table' doesn't exist >> [2012-11-08T14:43:10] error: It looks like the storage has gone away >> trying to reconnect >> [2012-11-08T14:43:10] error: We should have gotten a new id: Table >> 'slurm_acct_db.(null)_job_table' doesn't exist >> [2012-11-08T14:43:10] DBD_JOB_START: cluster not registered >> >> Mysql is the backend database type. It seems like I missed a step when >> upgrading from 2.2.7 to 2.4.3 but I can't figure out what it would be. >> >> sacctmgr seems to think the cluster is registered >> >> # sacctmgr list cluster >> Cluster ControlHost ControlPort RPC Share GrpJobs GrpNodes >> GrpSubmit MaxJobs MaxNodes MaxSubmit MaxWall QOS >> Def QOS >> ---------- --------------- ------------ --- --------- ------- -------- >> --------- ------- -------- --------- ----------- -------------------- >> --------- >> olympus 172.16.0.1 6817 10 1 >> normal >> >> >> The queue runs just and sacct shows me all the jobs that have run but >> I'm not getting any updates to sshare which I use for accounting >> purposes with sbank. Any ideas? >> >> Thanks >> >> Tim
