Hi Danny,


   Thanks for the good ideas earlier this week!  A higher debug level on 
slurmdbd set to 7 and logged to a file showed the query it was trying to 
perform but was failing on...



select name, control_port from cluster_table where delete=0 && (name='xxx')



   Somehow the mysql database didn't contain any entry in cluster_table.  I put 
in an entry and after the queue flushed we are now able to see start/end times 
for jobs.  There is quite a bit of missing data but I can get it from the 
joblog or sqlog(slurm) database.  I believe that we are good now and appreciate 
all of the help!



-Aron



From: [email protected] [mailto:[email protected]] On 
Behalf Of Auble, Danny
Sent: Monday, March 07, 2011 10:36 AM
To: [email protected]
Subject: [slurm-dev] RE: mysql slurm_acct_db not recording start and end times



Aron,



Do you happen to know what was going on with your slurmctld when you saw these 
messages?  Usually the slurmdbd is down for some reason warranting the restart. 
 If the slurmdbd was up that is the interesting part.



Do you know if there were a large number of job submissions during that period 
of time?  How long had the message been happening?  For future reference, you 
shouldn't have to do anything to the slurmctld when witnessing this problem.  
If the slurmdbd was up and running, nothing really needs to happen there 
either, you can probably wait it out, just monitor that the count is lowering 
in the slurmctld.log.  As for the missing times I don't know what can be done 
to get those back.  The slurmctld starts notifying at around 5k stored messages 
and doesn't accept any more after 10k to avoid running out of memory.



If your system is a very heavily loaded cluster you might consider upgrading to 
2.2 where the interface for such clusters between the slurmctld and slurmdbd 
has improved dramatically from 2.1 to 2.2.   Also, you say the slurmdbd was 
logging job requests.  I would consider lowering the debug level of your 
slurmdbd as well since I am guessing quite a bit of other logging was happening 
seriously impacting your performance.  Especially if you are heavily loaded, if 
you notice something going on and want more debug you can just update the debug 
level in the slurmdbd.conf and send a SIGHUP to the slurmdbd process forcing a 
reconfig to temporarily get a higher debug level without restarting.  If 
installed from the rpms you can also do a /etc/init.d/slurmdbd reconfig.



On your restart question, it is usually a better idea to restart the slurmdbd 
before the slurmctld, but it shouldn't really matter which one gets restarted 
first.



Your error message is caused from a job step giving a node that appears to be 
out of range of the actual job step.  Did the error happen when you restarted 
the slurmctld?



Danny



From: [email protected] [mailto:[email protected]] On 
Behalf Of Warren, Aron
Sent: Friday, March 04, 2011 3:17 PM
To: '[email protected]'
Subject: [slurm-dev] mysql slurm_acct_db not recording start and end times



Hi,



   I've got slurm-2.1.15 running and logging to mysql.  We had this error:



    error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW

    error: slurmdbd: agent queue is full, discarding request



   slurmdbd was running and logging job requests.  I restarted slurmdbd then 
restarted slurmctld.  The connection appears to have been re-established, no 
errors in the logs and the mysql database is being populated.  Pretty much 
every field is being populated (JobId, Submit Time, timelimit) but for the 
start and end times, they are all zeros.  I did check the database for a job 
that started and finished with a COMPLETED code after I went through the above 
steps.  What could be the reason for not recording those times?



   Is it advised that slurmdbd be started before slurmctld?



   Also what does this error mean in the slurmctld.log:

        error: step_partial_comp: JobID=XXXX last=1, nodes=1





thanks!

-Aron





Reply via email to