Hi,

I have a cluster with 18 nodes. I tried to add 2 new nodes and slurmctld
crash...

In nodes slurmd start right.


> */usr/local/sbin/slurmctld -Dvvvvvvvvvvvvvvvv*slurmctld: pidfile not
> locked, assuming no running daemon
> slurmctld: error: Configured MailProg is invalid
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/accounting_storage_filetxt.so
> slurmctld: debug2: slurmdb_init() called
> slurmctld: Accounting storage FileTxt plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: not enforcing associations and no list was given so we
> are giving a blank list
> slurmctld: debug3: Version in assoc_mgr_state header is 1
> slurmctld: slurmctld version 2.6.2 started on cluster (null)
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/crypto_munge.so
> slurmctld: Munge cryptographic signature plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/gres_rgpu.so
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/gres_gpu.so
> slurmctld: debug:  init: Gres GPU plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/select_cons_rgpu.so
> slurmctld: Consumable Resources RGPU Node Selection plugin loaded with
> argument 4
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/preempt_none.so
> slurmctld: preempt/none loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/checkpoint_none.so
> slurmctld: debug3: Success.
> slurmctld: Checkpoint plugin loaded: checkpoint/none
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/jobacct_gather_linux.so
> slurmctld: Job accounting gather LINUX plugin loaded
> slurmctld: debug3: Success.
> slurmctld: WARNING: We will use a much slower algorithm with
> proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
> proctrack when using jobacct_gather/linux
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/ext_sensors_none.so
> slurmctld: ExtSensors NONE plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug:  No backup controller to shutdown
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/switch_none.so
> slurmctld: switch NONE plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/topology_none.so
> slurmctld: topology NONE plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug:  No DownNodes
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/jobcomp_filetxt.so
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin
> /usr/local/lib/slurm/sched_backfill.so
> slurmctld: sched: Backfill scheduler plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Version string in node_state header is VER006
> slurmctld: Recovered state of 18 nodes
> slurmctld: debug3: Version string in job_state header is VER014
> slurmctld: debug3: Job id in job_state header is 52973
> Violación de segmento (`core' generado)


Before, I added nodes without problems, add to slurm.conf and restart
service slurmctld (in each node slurmd, you know).

I thought the slurm.conf was wrong, so I removed the new nodes an tried
again. Now I can't start anything, old slurm.conf or new slurm.conf.

attached the old slurm.conf that doesn't work now.

Thanks in advance!

Reply via email to