[slurm-dev] Re: Duplicate jobs in the grid_job_table

Chris Read Thu, 14 Mar 2013 10:59:07 -0700

That's all the SQL that I have available, but I've turned up the debug on
slurmdbd to the level now where it's logging full SQL statements for
everything, so let's see if we can catch it.


Having a closer look at that statement, we don't use GRES at all so I'd
expect gres_req and gres_alloc to be empty.



On Thu, Mar 14, 2013 at 4:46 PM, Danny Auble <[email protected]> wrote:

>  This is a slightly more serious issue than you think.  If you can send me
> the exact sql being sent (Do you know where the wacky '��8' came from?)
> perhaps it will be clearer where the problem lies.  With this situation
> your DBD will continue to back up information and new info will not be
> added to the database so it will appear out of sync.
>
>
> On 03/14/13 09:42, Chris Read wrote:
>
> We've just had a whole load of nodes fail, and a lot of jobs to
> rescheduled which recreated the problem of sacct not agreeing with squeue.
>
>  Found a load of these in the slurmdbd.log:
>
>  [2013-03-14T09:33:39-05:00] error: mysql_query failed: 1064 You have an
> error in your SQL syntax; check the manual that corresponds to your MySQL
> server version for the right syntax to use near '��8 ', 'drw', 'short',
> '**', '49') on duplicate key update job_db_inx=LAST_INSER' at line 1
> insert into "grid_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user,
> id_group, nodelist, id_resv, timelimit, time_eligible, time_submit,
> time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc,
> nodes_alloc, gres_req, gres_alloc, account, partition, wckey, node_inx)
> values (39641248, 139, 1, 0, 7045, 7045, 'n23', 0, 10, 1363271250,
> 1363271250, 1363271251, '<JOB_NAME>', 0, 3, 2549, 1, 1, 1, '��i�8 ', '�'��8
> ', '<acct>', 'short', '**', '49') on duplicate key update
> job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=7045,
> id_group=7045, nodelist='n23', id_resv=0, timelimit=10,
> time_submit=1363271250, time_start=1363271251, job_name='<JOB_NAME>',
> track_steps=0, id_qos=1, state=greatest(state, 3), priority=2549,
> cpus_req=1, cpus_alloc=1,  nodes_alloc=1, gres_req='��i�8 ',
> gres_alloc='�'��8 ', account='<acct>', partition='short', wckey='**',
> node_inx='49'
>
>  So I'm guessing that's how the confusing data got in there in the first
> place.
>
>
>
> On Wed, Mar 13, 2013 at 8:43 PM, Danny Auble 
> <[email protected]<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/428234671991/>
> > wrote:
>
>>  I would make sure these jobs weren't requeued.  Knowing what the times
>> were of the entries in the database would be interesting as well.  Any
>> information about the jobs in the slurmctld log would probably shed
>> information on the matter.  Outside of being requeued I wouldn't ever
>> expect duplicates.
>>
>> Danny
>>
>> On 03/12/13 10:57, Chris Read wrote:
>>
>> Update:
>>
>>  I've just cleaned things up by deleting the duplicates where state = 0
>> (PENDING). The correct state for the job is actually 7 (NODE_FAIL), not
>> CANCELLED as I stated above.
>>
>>  No need to restart slurmdbd either...
>>
>>  Chris
>>
>>
>> On Tue, Mar 12, 2013 at 4:28 PM, Chris Read 
>> <[email protected]<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/262606412853/>
>> > wrote:
>>
>>> Forgot to add another question:
>>>
>>>  What's the correct way to clean this up? Just delete the record
>>> showing PENDING and restart slurmdbd?
>>>
>>>
>>>  On Tue, Mar 12, 2013 at 4:26 PM, Chris Read 
>>> <[email protected]<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/262606412853/>
>>> > wrote:
>>>
>>>> Greetings...
>>>>
>>>>  Just stumbled across some strange behaviour on the accounting side of
>>>> things: we have a collection of jobs that have duplicate records in the
>>>> grid_job_table. The visible symptoms of this are that sacct shows the jobs
>>>> as still pending when they are not.
>>>>
>>>>  In all of the cases I can find:
>>>>
>>>>  - there is no information available in the slurmdbd.log
>>>> - the slurmctld.log shows the jobs have been canceled
>>>> - the job_db_inx for the entry in PENDING state is > the job_db_inx for
>>>> the entry in CANCELLED
>>>>
>>>>  We're currently on 2.5.1.
>>>>
>>>>  Anyone have any idea how these got there?
>>>>
>>>>  Chris
>>>>
>>>
>>>
>>
>>
>
>

[slurm-dev] Re: Duplicate jobs in the grid_job_table

Reply via email to