Hi,

In version: 14.11.1

In slurm.conf,  I tried using the:

AcctGatherProfileType=acct_gather_profile/hdf5
http://slurm.schedmd.com/acct_gather_profile_plugins.html


In file:  acct_gather.conf,  when he following is enabled:
#ProfileHDF5Default=All


I discovered that several GBs of data were being acquired quickly, over a few days. It could have been much more if the load and cluster were larger.


So, I wanted to turn this off, at least the default all, and change the location of the data collected. So I created a new location, edited the file: acct_gather.conf, and proceeded with scontrol reconfigure.


Now I wanted to save the existing data and move it to the new location. After checking if any of the files in the path were open using lsof I tried the following:

rsync -av /g0/opt/slurm/14.11.1/Profile_hdf5_data/* /lustre/scratch/slurm/14.11.1/Profile_hdf5_data/


The system started the rsync and within ~2 seconds the server became unresponsive and rebooted. I've never experienced this before where a system reboots as a result of an rsync.


I'd like to learn what slurm and hdf are doing such that editing the directory data initiates a system reboot, immediate crash. This was not a fluke. I repeated the rsync again today because initially I had no idea what caused the crash/reboot. I thought there was a fault in the hardware but found no error in the bmc event logs.

I can understand it may not be good to rsync the data while slurm is running. I would expect the rsync on opened files/directories to possibly cause a running application to fail/crash, break, or produce all kinds of errors. However, why does the system immediately crash? There was no error anywhere, no kernel panic, etc. I checked if the files were open using: lsof, and thus concluded that it would be safe to rsync the data. This accounting method seems to make the system vulnerable.The rsync works when slurmctl is stopped, thus the problem appears to be within slurm and the ProfileHDF5Dir. Will the system also be vulnerable if the data in this path is copied out or read while in place with hdfviewer?


Can anyone explain what is happening?

Thanks,
Kevin







Partial configs listed below




------------------------------------------------
acct_gather.conf
------------------------------------------------
###
# Slurm acct_gather configuration file
###
#    http://slurm.schedmd.com/acct_gather.conf.html
#

# Parameters for AcctGatherEnergy/impi plugin
EnergyIPMIFrequency=10
EnergyIPMICalcAdjustment=yes

# Parameters for AcctGatherProfileType/hdf5 plugin
ProfileHDF5Dir=/lustre/scratch/slurm/14.11.1/Profile_hdf5_data
#ProfileHDF5Dir=/g0/opt/slurm/14.11.1/Profile_hdf5_data
#ProfileHDF5Default=All




------------------------------------------------
slurm.conf    (partial)
------------------------------------------------
AcctGatherNodeFreq=60
AcctGatherEnergyType=acct_gather_energy/ipmi
AcctGatherFilesystemType=acct_gather_filesystem/lustre
AcctGatherProfileType=acct_gather_profile/hdf5
JobAcctGatherFrequency=15
JobacctGatherType=jobacct_gather/cgroup


--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: [email protected]

Reply via email to