We are seeing a lot of jobs failing to run on our BGQ. I don't know if the 
problem is in Slurm or the BGQ software. I suspect it is the communication 
between the two so I am asking both IBM and here about this. When jobs are 
submitted to Slurm many of them fail with 

2012-10-04 08:25:05.531 (FATAL) [0xfffb05d8b70] 11605:ibm.runjob.client.Job: 
could not start job: job failed to start
2012-10-04 08:25:05.531 (FATAL) [0xfffb05d8b70] 11605:ibm.runjob.client.Job: 
I/O node R00-IC-J04 is not connected

The IO node listed varies but the rest of it is common among all jobs. The 
sizes of the jobs failing to run varies over time but at any one time it seems 
like all jobs of certain sizes fail while jobs of other sizes all run. For 
instance last night all jobs of 256, 128 or 64 nodes failed while all jobs 
asking for fewer nodes failed. The day before it jobs of 64 and 128 running and 
jobs 32 nodes and below failing. Over last night I had 54 jobs try to run - 24 
ran without error and 30 failed to start.

We have a single rack BGQ system. Slurm is allocating two 512 node blocks and 
running all job in these blocks in a shared mode. Our slurm.conf and 
bluegene.conf files are attached. I don't know if this started with Slurm 2.4.3 
or with the update of the BGQ software to 1.1.2. Both of these updates happened 
within the past couple of weeks.

There are no errors being logged in the BGQ log files.

I have run out of things to try. I don't really want to go back to fixed block 
allocations since this DYNAMIC setting gives us a lot of flexibility and helps 
keep the system utilization higher.

Thanks for any pointers or even confirmations if any one else is seeing this 
kind of behavior.

-- 
Carl Schmidtmann 
Center for Integrated Research Computing 
University of Rochester 

Attachment: slurm.conf
Description: Binary data

Attachment: bluegene.conf
Description: Binary data

Reply via email to