Hi Danny,
Thanks for the good ideas earlier this week! A higher debug level on slurmdbd set to 7 and logged to a file showed the query it was trying to perform but was failing on... select name, control_port from cluster_table where delete=0 && (name='xxx') Somehow the mysql database didn't contain any entry in cluster_table. I put in an entry and after the queue flushed we are now able to see start/end times for jobs. There is quite a bit of missing data but I can get it from the joblog or sqlog(slurm) database. I believe that we are good now and appreciate all of the help! -Aron From: [email protected] [mailto:[email protected]] On Behalf Of Auble, Danny Sent: Monday, March 07, 2011 10:36 AM To: [email protected] Subject: [slurm-dev] RE: mysql slurm_acct_db not recording start and end times Aron, Do you happen to know what was going on with your slurmctld when you saw these messages? Usually the slurmdbd is down for some reason warranting the restart. If the slurmdbd was up that is the interesting part. Do you know if there were a large number of job submissions during that period of time? How long had the message been happening? For future reference, you shouldn't have to do anything to the slurmctld when witnessing this problem. If the slurmdbd was up and running, nothing really needs to happen there either, you can probably wait it out, just monitor that the count is lowering in the slurmctld.log. As for the missing times I don't know what can be done to get those back. The slurmctld starts notifying at around 5k stored messages and doesn't accept any more after 10k to avoid running out of memory. If your system is a very heavily loaded cluster you might consider upgrading to 2.2 where the interface for such clusters between the slurmctld and slurmdbd has improved dramatically from 2.1 to 2.2. Also, you say the slurmdbd was logging job requests. I would consider lowering the debug level of your slurmdbd as well since I am guessing quite a bit of other logging was happening seriously impacting your performance. Especially if you are heavily loaded, if you notice something going on and want more debug you can just update the debug level in the slurmdbd.conf and send a SIGHUP to the slurmdbd process forcing a reconfig to temporarily get a higher debug level without restarting. If installed from the rpms you can also do a /etc/init.d/slurmdbd reconfig. On your restart question, it is usually a better idea to restart the slurmdbd before the slurmctld, but it shouldn't really matter which one gets restarted first. Your error message is caused from a job step giving a node that appears to be out of range of the actual job step. Did the error happen when you restarted the slurmctld? Danny From: [email protected] [mailto:[email protected]] On Behalf Of Warren, Aron Sent: Friday, March 04, 2011 3:17 PM To: '[email protected]' Subject: [slurm-dev] mysql slurm_acct_db not recording start and end times Hi, I've got slurm-2.1.15 running and logging to mysql. We had this error: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW error: slurmdbd: agent queue is full, discarding request slurmdbd was running and logging job requests. I restarted slurmdbd then restarted slurmctld. The connection appears to have been re-established, no errors in the logs and the mysql database is being populated. Pretty much every field is being populated (JobId, Submit Time, timelimit) but for the start and end times, they are all zeros. I did check the database for a job that started and finished with a COMPLETED code after I went through the above steps. What could be the reason for not recording those times? Is it advised that slurmdbd be started before slurmctld? Also what does this error mean in the slurmctld.log: error: step_partial_comp: JobID=XXXX last=1, nodes=1 thanks! -Aron
