We upgraded the BG/Q to V1R2M0 and Slurm to 2.5.1 in late January. The update 
to RHEL 6.3 was no problem since we had actually done that already with BG/Q 
V1R1M2 but we did recompile slurm after all the other updates. Not sure if it 
is required but since slurm interfaces with the runjob_mux it is probably a 
good idea. We tested first on RHEL 6.3 using the BG/Q simulator compile on 
another machine. We have never encountered any problems with building slurm on 
any version of the OS. I always start with a fresh unpack of the source, add in 
my plugin, use the config line from the previous build and everything builds 
and installs cleanly. Even the current queue of jobs is maintained. The IBM 
V1R2M0 update was much harder to get everything built and installed properly. 
The toolchain had to be built manually after the update and the gpfs part 
required undocumented manual configuration along with some magic incantations 
to get it installed properly in the IO node OS image. 

Unfortunately on the day of the update we made lots of changes all at once - 
BG/Q update, Slurm update, Slurm config changes, Slurm plugin changes, GPFS 
changes, user account changes. So when a subtle but serious LDAP error made 
everything go sideways we spent days backing out changes and testing trying to 
determine what caused the problem. Once the LDAP problem was found and 
corrected all the changes in the system worked as expected. 

We are seeing more "Software Errors" from the Q which required efix 010 from 
IBM so the errors clear properly. We haven't determined the cause of the errors 
yet. Our users have modified their behavior when running jobs and we changed 
from using licenses in Slurm to running a dummy pre-emptable job to implement a 
debug queue so we don't know what change is causing the problem. We have a 
single rack BG/Q and our users mostly run 16 or 32 node jobs. Apparently the 
problem is more prevalent with systems running small jobs like this. Slurm 
knows how to deal with this automatically but it requires that one midplane be 
cleared of jobs and rebooted to clear the error. 

Carl 

----- Original Message -----

> [slurm-dev] Questions upgrading on RHEL 6.3 and driver V1R2M0 on BG/Q

> Hi,

> we are planning upgrading our BG/Q to V1R2M0 and Slurm to 2.5.3 (from
> 2.4.3).
> A prerequisite of the V1R2M0 driver is the upgrade of the frontend
> and service nodes to RHEl 6.3.

> We have few questions about how to manage these updates, thanks by
> advance if you can help us :
> - Do we need to recompile Slurm with the new driver ?
> - Can we have problems running slurm compiled on a RHEL 6.2 on RHEL
> 6.3 ?

> 
> If you have any informations or suggestions concerning this upgrade,
> your comments will be welcome !

> Regards,

> Benoit

-- 

Carl Schmidtmann 
Center for Integrated Research Computing 
University of Rochester 

Reply via email to