Hi!

I am trying install slurmd version 2.6.5 on Red Hat Enterprise Linux
Server release 5.1

First I am trying to install slurm on a single node I am getting
problem given below.


ps -el | grep slurmctld
its giving nothing

ps -el | grep slurmd
its giving nothing

/etc/init.d/slurm start
its giving /etc/init.d/slurm: command not found

sudo service slurm start
its giving  service: command not found

sudo slurmd -Dvvvvv          its giving output given below

slurmd: Node configuration differs from hardware: CPUs=6:6(hw)
Boards=1:1(hw) SocketsPerBoard=6:1(hw) CoresPerSocket=1:6(hw)
ThreadsPerCore=1:1(hw)
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/topology_none.so
slurmd: topology NONE plugin loaded
slurmd: debug3: Success.
slurmd: Gathering cpu frequency information for 6 cpus
slurmd: debug:  cpu_freq_init: cpu 0, reset freq: 1600000, reset
governor: userspace
slurmd: debug:  cpu_freq_init: cpu 1, reset freq: 1600000, reset
governor: userspace
slurmd: debug:  cpu_freq_init: cpu 2, reset freq: 1600000, reset
governor: userspace
slurmd: debug:  cpu_freq_init: cpu 3, reset freq: 1600000, reset
governor: userspace
slurmd: debug:  cpu_freq_init: cpu 4, reset freq: 1600000, reset
governor: userspace
slurmd: debug:  cpu_freq_init: cpu 5, reset freq: 1600000, reset
governor: userspace
slurmd: debug3: NodeName    = localhost
slurmd: debug3: TopoAddr    = localhost
slurmd: debug3: TopoPattern = node
slurmd: debug3: CacheGroups = 0
slurmd: debug3: Confile     = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug       = 3
slurmd: debug3: CPUs        = 6  (CF:  6, HW:  6)
slurmd: debug3: Boards      = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets     = 6  (CF:  6, HW:  1)
slurmd: debug3: Cores       = 1  (CF:  1, HW:  6)
slurmd: debug3: Threads     = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime      = 14155466 = 163-20:04:26
slurmd: debug3: Block Map   = 0,1,2,3,4,5
slurmd: debug3: Inverse Map = 0,1,2,3,4,5
slurmd: debug3: RealMemory  = 24098
slurmd: debug3: TmpDisk     = 48418
slurmd: debug3: Epilog      = `(null)'
slurmd: debug3: Logfile     = `(null)'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName    = localhost
slurmd: debug3: NodeAddr    = (null)
slurmd: debug3: Port        = 6818
slurmd: debug3: Prolog      = `(null)'
slurmd: debug3: TmpFS       = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd  = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir   = `/tmp/slurmd'
slurmd: debug3: Pid File    = `/var/run/slurmd.pid'
slurmd: debug3: Slurm UID   = 0
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: Use PAM     = 0
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/proctrack_pgid.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/task_none.so
slurmd: task NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so
slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmd: debug3: Success.
slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so
slurmd: Munge cryptographic signature plugin loaded
slurmd: debug3: Success.
slurmd: debug3: initializing slurmd spool directory
slurmd: debug3: slurmd initialization successful
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 2.6.5 started
slurmd: debug3: finished daemonize
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_none.so
slurmd: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
slurmd: switch NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: successfully opened slurm listen port *:6818
slurmd: slurmd started on Tue, 11 Feb 2014 15:27:09 +0530
slurmd: CPUs=6 Boards=1 Sockets=6 Cores=1 Threads=1 Memory=24098
TmpDisk=48418 Uptime=14155466
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_energy_none.so
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_profile_none.so
slurmd: AcctGatherProfile NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_infiniband_none.so
slurmd: AcctGatherInfiniband NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_filesystem_none.so
slurmd: AcctGatherFilesystem NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf)
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at
127.0.0.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at
127.0.0.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at
127.0.0.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: error: Unable to register: Unable to contact slurm controller
(connect failure)
slurmd: debug:  Unable to register with slurm controller, retrying
slurmd: debug3: CPUs=6 Boards=1 Sockets=6 Cores=1 Threads=1
Memory=24098 TmpDisk=48418 Uptime=14155476
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at
127.0.0.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused





sudo slurmctld -Dvvvvv      its giving output given below

debug3: Trying to load plugin /usr/local/lib/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given so
we are giving a blank list
slurmctld: debug3: Version in assoc_mgr_state header is 1
slurmctld: slurmctld version 2.6.5 started on cluster hybrid_cluster
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/select_linear.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/ext_sensors_none.so
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No backup controller to shutdown
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
slurmctld: switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  Reading slurm.conf file: /usr/local/etc/slurm.conf
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/topology_none.so
slurmctld: topology NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No DownNodes
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/jobcomp_none.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/sched_backfill.so
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Version string in node_state header is VER006
slurmctld: Recovered state of 1 nodes
slurmctld: debug3: Version string in job_state header is VER014
slurmctld: debug3: Job id in job_state header is 1
slurmctld: debug3: Set job_id_sequence to 1
slurmctld: Recovered information about 0 jobs
slurmctld: debug:  Updating partition uid access list
slurmctld: debug3: Version string in resv_state header is VER004
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/priority_basic.so
slurmctld: debug:  Priority BASIC plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: _slurmctld_background pid = 9501
slurmctld: debug3: _slurmctld_rpc_mgr pid = 9501
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug:  Spawning registration agent for localhost 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type 1001
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to localhost
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.0.1:6818: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.0.1:6818: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.0.1:6818: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.0.1:6818: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
127.0.0.1:6818: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug3: problems with localhost
slurmctld: debug2: Tree head got back 1
slurmctld: agent/is_node_resp: node:localhost rpc:1001 : Communication
connection failure
^[[Bslurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints

For running slurmd on all compute node we need to install slurm 2.6.5
on all the compute node manually ?????

I am new to slurm please help me to resolve these problems I am
attaching my slurm.conf file also.


Thanks
Nagendra

Attachment: slurm.conf
Description: Binary data

Reply via email to