Hello Danny

I have now found a small setup to reproduce this problem. We have a test
installation with one head node and three computing nodes. The computing
nodes are configured with the option shared=YES to increase the
throughput. I am using a simple "hello world" program for testing. This
program is started via a sbatch script (with option --share), and the
sbatch script itself is submitted severall thousand times by a shell
script. After some hundred jobs the problem occurs.

error message from slurmdbd.log
insert into "loewe_test_job_table" (id_job, id_assoc, id_qos, id_wckey,
id_user, id_group, nodelist, id_resv, timelimit, time_eligible,
time_submit, time_start, job_name, track_steps, state, priority,
cpus_req, cpus_alloc, nodes_alloc, account, partition, node_inx) values
(9559, 4, 1, 0, 553, 516, '0?^U\<AC>*', 0, 15, 1329831643, 1329831643,
1329831643, 'hello_wo<90>^B', 0, 3, 1009236, 1, 1, 1, 'staff', '<E0>'',
'1') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx),
id_wckey=0, id_user=553, id_group=516, nodelist='0?^U\<AC>*', id_resv=0,
timelimit=15, time_submit=1329831643, time_start=1329831643,
job_name='hello_wo<90>^B', track_steps=0, id_qos=1,
state=greatest(state, 3), priority=1009236, cpus_req=1, cpus_alloc=1,
nodes_alloc=1, account='staff', partition='<E0>'', node_inx='1'

All jobs have the name "hello_world" and were submitted to partition "test".

Tibor

Am 02.02.2012 19:12, schrieb Danny Auble:
> I don't know what is going on then.  If you look at the SQL it would
> appear all sorts of interesting characters are in there.  I thought
> your nodelist was very interesting as well.  Job name was also very
> interesting.
>
> Perhaps you should look at to why those names are coming through.  A '
> in a job name is already handled.  I don't know how the partition name
> could be messed with though.
>
> Danny
>
> On 02/02/12 10:05, Tibor Pausz wrote:
>> Hello Danny,
>> we don't have any ' inside the cluster name, any partition or
>> somewhere else!
>> I don't know if some users have submitted jobs where ' ist part of
>> the job
>> name, but we have not used any special charater inside our config.
>>
>> Best regards,
>> Tibor
>>
>> On Thu, Feb 02, 2012 at 08:51:20AM -0800, Danny Auble wrote:
>>> Tibor, the problem comes from the ' in your partition name.  Up to
>>> this time I don't think anyone has ever done that.  I am not sure
>>> what other problems might arise from that name either.  But if this
>>> is the first time you have seen it this might be one of the only
>>> problems it causes.
>>>
>>> To get around it in the code you can call
>>> slurm_add_slash_to_quotes() as done elsewhere in the code on the
>>> partition name in src/plugins/mysql/as_mysql_job.c, pretty much
>>> where ever you see the partition name.  It might need to be done
>>> elsewhere as well in the mysql plugin.  Let us know how it goes.
>>> Don't forget to xfree the new partition name after it's use.
>>>
>>> Danny
>>>
>>> On 02/02/12 01:06, Tibor Pausz wrote:
>>>> Hello!
>>>>
>>>> we have trouble with the slurmdbd (version 2.3.0-2) in combination
>>>> with
>>>> MySQL accouting. We have severall entries in the slurmdbd.log per hour
>>>> with the same kind of error (see below). After some time the slurmdbd
>>>> stucks.
>>>>
>>>> error: mysql_query failed: 1064 You have an error in your SQL syntax;
>>>> check the manual that corresponds to your MySQL server version for the
>>>> right syntax to use near '<A4><AC>*', '744') on duplicate key update
>>>> job_db_inx=LAST_INSERT_ID(job_db_inx), id_w' at line 1
>>>> insert into "loewe_job_table" (id_job, id_assoc, id_qos, id_wckey,
>>>> id_user, id_group, nodelist, id_resv, timelimit, time_eligible,
>>>> time_submit, time_start, job_name, track_steps, state, priority,
>>>> cpus_req, cpus_alloc, nodes_alloc, account, partition, node_inx)
>>>> values
>>>> (2290165, 463, 3, 0, 536, 525, '<A0>+^F<A4><AC>*', 0, 1440,
>>>> 1327054146,
>>>> 1327054146, 1327054146, '0^N<A4><AC>*', 0, 5, 100006, 1, 1, 1,
>>>> 'tomograpp','<D0>'<A4><AC>*', '744') on duplicate key update
>>>> job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=536,
>>>> id_group=525, nodelist='<A0>+^F<A4><AC>*', id_resv=0, timelimit=1440,
>>>> time_submit=1327054146, time_start=1327054146,
>>>> job_name='0^N<A4><AC>*',
>>>> track_steps=0, id_qos=3, state=greatest(state, 5), priority=100006,
>>>> cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='tomograpp',
>>>> partition='
>>>> <D0>'<A4><AC>*', node_inx='744'
>>>>
>>>>
>>>> The slurmctld.log contains
>>>> error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
>>>> …
>>>>

Reply via email to