We upgraded the BG/Q to V1R2M0 and Slurm to 2.5.1 in late January. The update to RHEL 6.3 was no problem since we had actually done that already with BG/Q V1R1M2 but we did recompile slurm after all the other updates. Not sure if it is required but since slurm interfaces with the runjob_mux it is probably a good idea. We tested first on RHEL 6.3 using the BG/Q simulator compile on another machine. We have never encountered any problems with building slurm on any version of the OS. I always start with a fresh unpack of the source, add in my plugin, use the config line from the previous build and everything builds and installs cleanly. Even the current queue of jobs is maintained. The IBM V1R2M0 update was much harder to get everything built and installed properly. The toolchain had to be built manually after the update and the gpfs part required undocumented manual configuration along with some magic incantations to get it installed properly in the IO node OS image.
Unfortunately on the day of the update we made lots of changes all at once - BG/Q update, Slurm update, Slurm config changes, Slurm plugin changes, GPFS changes, user account changes. So when a subtle but serious LDAP error made everything go sideways we spent days backing out changes and testing trying to determine what caused the problem. Once the LDAP problem was found and corrected all the changes in the system worked as expected. We are seeing more "Software Errors" from the Q which required efix 010 from IBM so the errors clear properly. We haven't determined the cause of the errors yet. Our users have modified their behavior when running jobs and we changed from using licenses in Slurm to running a dummy pre-emptable job to implement a debug queue so we don't know what change is causing the problem. We have a single rack BG/Q and our users mostly run 16 or 32 node jobs. Apparently the problem is more prevalent with systems running small jobs like this. Slurm knows how to deal with this automatically but it requires that one midplane be cleared of jobs and rebooted to clear the error. Carl ----- Original Message ----- > [slurm-dev] Questions upgrading on RHEL 6.3 and driver V1R2M0 on BG/Q > Hi, > we are planning upgrading our BG/Q to V1R2M0 and Slurm to 2.5.3 (from > 2.4.3). > A prerequisite of the V1R2M0 driver is the upgrade of the frontend > and service nodes to RHEl 6.3. > We have few questions about how to manage these updates, thanks by > advance if you can help us : > - Do we need to recompile Slurm with the new driver ? > - Can we have problems running slurm compiled on a RHEL 6.2 on RHEL > 6.3 ? > > If you have any informations or suggestions concerning this upgrade, > your comments will be welcome ! > Regards, > Benoit -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester
