We are seeing a lot of jobs failing to run on our BGQ. I don't know if the problem is in Slurm or the BGQ software. I suspect it is the communication between the two so I am asking both IBM and here about this. When jobs are submitted to Slurm many of them fail with
2012-10-04 08:25:05.531 (FATAL) [0xfffb05d8b70] 11605:ibm.runjob.client.Job: could not start job: job failed to start 2012-10-04 08:25:05.531 (FATAL) [0xfffb05d8b70] 11605:ibm.runjob.client.Job: I/O node R00-IC-J04 is not connected The IO node listed varies but the rest of it is common among all jobs. The sizes of the jobs failing to run varies over time but at any one time it seems like all jobs of certain sizes fail while jobs of other sizes all run. For instance last night all jobs of 256, 128 or 64 nodes failed while all jobs asking for fewer nodes failed. The day before it jobs of 64 and 128 running and jobs 32 nodes and below failing. Over last night I had 54 jobs try to run - 24 ran without error and 30 failed to start. We have a single rack BGQ system. Slurm is allocating two 512 node blocks and running all job in these blocks in a shared mode. Our slurm.conf and bluegene.conf files are attached. I don't know if this started with Slurm 2.4.3 or with the update of the BGQ software to 1.1.2. Both of these updates happened within the past couple of weeks. There are no errors being logged in the BGQ log files. I have run out of things to try. I don't really want to go back to fixed block allocations since this DYNAMIC setting gives us a lot of flexibility and helps keep the system utilization higher. Thanks for any pointers or even confirmations if any one else is seeing this kind of behavior. -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester
slurm.conf
Description: Binary data
bluegene.conf
Description: Binary data
