Hello Danny I have now found a small setup to reproduce this problem. We have a test installation with one head node and three computing nodes. The computing nodes are configured with the option shared=YES to increase the throughput. I am using a simple "hello world" program for testing. This program is started via a sbatch script (with option --share), and the sbatch script itself is submitted severall thousand times by a shell script. After some hundred jobs the problem occurs.
error message from slurmdbd.log insert into "loewe_test_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, account, partition, node_inx) values (9559, 4, 1, 0, 553, 516, '0?^U\<AC>*', 0, 15, 1329831643, 1329831643, 1329831643, 'hello_wo<90>^B', 0, 3, 1009236, 1, 1, 1, 'staff', '<E0>'', '1') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=553, id_group=516, nodelist='0?^U\<AC>*', id_resv=0, timelimit=15, time_submit=1329831643, time_start=1329831643, job_name='hello_wo<90>^B', track_steps=0, id_qos=1, state=greatest(state, 3), priority=1009236, cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='staff', partition='<E0>'', node_inx='1' All jobs have the name "hello_world" and were submitted to partition "test". Tibor Am 02.02.2012 19:12, schrieb Danny Auble: > I don't know what is going on then. If you look at the SQL it would > appear all sorts of interesting characters are in there. I thought > your nodelist was very interesting as well. Job name was also very > interesting. > > Perhaps you should look at to why those names are coming through. A ' > in a job name is already handled. I don't know how the partition name > could be messed with though. > > Danny > > On 02/02/12 10:05, Tibor Pausz wrote: >> Hello Danny, >> we don't have any ' inside the cluster name, any partition or >> somewhere else! >> I don't know if some users have submitted jobs where ' ist part of >> the job >> name, but we have not used any special charater inside our config. >> >> Best regards, >> Tibor >> >> On Thu, Feb 02, 2012 at 08:51:20AM -0800, Danny Auble wrote: >>> Tibor, the problem comes from the ' in your partition name. Up to >>> this time I don't think anyone has ever done that. I am not sure >>> what other problems might arise from that name either. But if this >>> is the first time you have seen it this might be one of the only >>> problems it causes. >>> >>> To get around it in the code you can call >>> slurm_add_slash_to_quotes() as done elsewhere in the code on the >>> partition name in src/plugins/mysql/as_mysql_job.c, pretty much >>> where ever you see the partition name. It might need to be done >>> elsewhere as well in the mysql plugin. Let us know how it goes. >>> Don't forget to xfree the new partition name after it's use. >>> >>> Danny >>> >>> On 02/02/12 01:06, Tibor Pausz wrote: >>>> Hello! >>>> >>>> we have trouble with the slurmdbd (version 2.3.0-2) in combination >>>> with >>>> MySQL accouting. We have severall entries in the slurmdbd.log per hour >>>> with the same kind of error (see below). After some time the slurmdbd >>>> stucks. >>>> >>>> error: mysql_query failed: 1064 You have an error in your SQL syntax; >>>> check the manual that corresponds to your MySQL server version for the >>>> right syntax to use near '<A4><AC>*', '744') on duplicate key update >>>> job_db_inx=LAST_INSERT_ID(job_db_inx), id_w' at line 1 >>>> insert into "loewe_job_table" (id_job, id_assoc, id_qos, id_wckey, >>>> id_user, id_group, nodelist, id_resv, timelimit, time_eligible, >>>> time_submit, time_start, job_name, track_steps, state, priority, >>>> cpus_req, cpus_alloc, nodes_alloc, account, partition, node_inx) >>>> values >>>> (2290165, 463, 3, 0, 536, 525, '<A0>+^F<A4><AC>*', 0, 1440, >>>> 1327054146, >>>> 1327054146, 1327054146, '0^N<A4><AC>*', 0, 5, 100006, 1, 1, 1, >>>> 'tomograpp','<D0>'<A4><AC>*', '744') on duplicate key update >>>> job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=536, >>>> id_group=525, nodelist='<A0>+^F<A4><AC>*', id_resv=0, timelimit=1440, >>>> time_submit=1327054146, time_start=1327054146, >>>> job_name='0^N<A4><AC>*', >>>> track_steps=0, id_qos=3, state=greatest(state, 5), priority=100006, >>>> cpus_req=1, cpus_alloc=1, nodes_alloc=1, account='tomograpp', >>>> partition=' >>>> <D0>'<A4><AC>*', node_inx='744' >>>> >>>> >>>> The slurmctld.log contains >>>> error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW >>>> … >>>>
