Aron,

Do you happen to know what was going on with your slurmctld when you saw these 
messages?  Usually the slurmdbd is down for some reason warranting the restart. 
 If the slurmdbd was up that is the interesting part.

Do you know if there were a large number of job submissions during that period 
of time?  How long had the message been happening?  For future reference, you 
shouldn’t have to do anything to the slurmctld when witnessing this problem.  
If the slurmdbd was up and running, nothing really needs to happen there 
either, you can probably wait it out, just monitor that the count is lowering 
in the slurmctld.log.  As for the missing times I don’t know what can be done 
to get those back.  The slurmctld starts notifying at around 5k stored messages 
and doesn’t accept any more after 10k to avoid running out of memory.

If your system is a very heavily loaded cluster you might consider upgrading to 
2.2 where the interface for such clusters between the slurmctld and slurmdbd 
has improved dramatically from 2.1 to 2.2.   Also, you say the slurmdbd was 
logging job requests.  I would consider lowering the debug level of your 
slurmdbd as well since I am guessing quite a bit of other logging was happening 
seriously impacting your performance.  Especially if you are heavily loaded, if 
you notice something going on and want more debug you can just update the debug 
level in the slurmdbd.conf and send a SIGHUP to the slurmdbd process forcing a 
reconfig to temporarily get a higher debug level without restarting.  If 
installed from the rpms you can also do a /etc/init.d/slurmdbd reconfig.

On your restart question, it is usually a better idea to restart the slurmdbd 
before the slurmctld, but it shouldn’t really matter which one gets restarted 
first.

Your error message is caused from a job step giving a node that appears to be 
out of range of the actual job step.  Did the error happen when you restarted 
the slurmctld?

Danny

From: [email protected] [mailto:[email protected]] On 
Behalf Of Warren, Aron
Sent: Friday, March 04, 2011 3:17 PM
To: '[email protected]'
Subject: [slurm-dev] mysql slurm_acct_db not recording start and end times

Hi,

   I've got slurm-2.1.15 running and logging to mysql.  We had this error:

    error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
    error: slurmdbd: agent queue is full, discarding request

   slurmdbd was running and logging job requests.  I restarted slurmdbd then 
restarted slurmctld.  The connection appears to have been re-established, no 
errors in the logs and the mysql database is being populated.  Pretty much 
every field is being populated (JobId, Submit Time, timelimit) but for the 
start and end times, they are all zeros.  I did check the database for a job 
that started and finished with a COMPLETED code after I went through the above 
steps.  What could be the reason for not recording those times?

   Is it advised that slurmdbd be started before slurmctld?

   Also what does this error mean in the slurmctld.log:
        error: step_partial_comp: JobID=XXXX last=1, nodes=1


thanks!
-Aron


Reply via email to