Hi, I have a cluster with 18 nodes. I tried to add 2 new nodes and slurmctld crash...
In nodes slurmd start right. > */usr/local/sbin/slurmctld -Dvvvvvvvvvvvvvvvv*slurmctld: pidfile not > locked, assuming no running daemon > slurmctld: error: Configured MailProg is invalid > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/accounting_storage_filetxt.so > slurmctld: debug2: slurmdb_init() called > slurmctld: Accounting storage FileTxt plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: not enforcing associations and no list was given so we > are giving a blank list > slurmctld: debug3: Version in assoc_mgr_state header is 1 > slurmctld: slurmctld version 2.6.2 started on cluster (null) > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/crypto_munge.so > slurmctld: Munge cryptographic signature plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/gres_rgpu.so > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/gres_gpu.so > slurmctld: debug: init: Gres GPU plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/select_cons_rgpu.so > slurmctld: Consumable Resources RGPU Node Selection plugin loaded with > argument 4 > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/preempt_none.so > slurmctld: preempt/none loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/checkpoint_none.so > slurmctld: debug3: Success. > slurmctld: Checkpoint plugin loaded: checkpoint/none > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/jobacct_gather_linux.so > slurmctld: Job accounting gather LINUX plugin loaded > slurmctld: debug3: Success. > slurmctld: WARNING: We will use a much slower algorithm with > proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other > proctrack when using jobacct_gather/linux > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/ext_sensors_none.so > slurmctld: ExtSensors NONE plugin loaded > slurmctld: debug3: Success. > slurmctld: debug: No backup controller to shutdown > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/switch_none.so > slurmctld: switch NONE plugin loaded > slurmctld: debug3: Success. > slurmctld: debug: Reading slurm.conf file: /etc/slurm/slurm.conf > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/topology_none.so > slurmctld: topology NONE plugin loaded > slurmctld: debug3: Success. > slurmctld: debug: No DownNodes > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/jobcomp_filetxt.so > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /usr/local/lib/slurm/sched_backfill.so > slurmctld: sched: Backfill scheduler plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Version string in node_state header is VER006 > slurmctld: Recovered state of 18 nodes > slurmctld: debug3: Version string in job_state header is VER014 > slurmctld: debug3: Job id in job_state header is 52973 > Violación de segmento (`core' generado) Before, I added nodes without problems, add to slurm.conf and restart service slurmctld (in each node slurmd, you know). I thought the slurm.conf was wrong, so I removed the new nodes an tried again. Now I can't start anything, old slurm.conf or new slurm.conf. attached the old slurm.conf that doesn't work now. Thanks in advance!
