Hi Brad,

Just to get a couple of easy questions out of the way:

0. When you say that "bioshock" is "the main machine," does that mean it's the master node?
1. Have the /etc/hosts definitions been propagated across the cluster?
2. Can you ping master from each of the clients, and vice versa?
3. Does "munge -n | unmunge" generate a successful result on each of the nodes?

Andy

On 11/15/2011 09:44 AM, Brad Reisfeld wrote:
Hi,

I am trying to use slurm on a small cluster (master node + 5 compute
nodes). I am just getting started with slurm, so please forgive me
for bringing up what are likely very basic issues and problems. I
couldn't find relevant solutions by looking in the mailing list
archive or by googling.

platform: Linux CentOS v5
slurm: installed from rpms based on slurm-2.3.1.tar.bz2.

I installed munge-0.5.10 and it appears to be working on the master
and all of the compute nodes.

I have the ip addresses of the master node ('master') and compute
nodes ('cn1',...,'cn5') in /etc/hosts. The main machine ('bioshock')
has two network interfaces and I can successfully ping the master
node and all of the compute nodes from it.

I have the line 'ControlMachine=master' in my slurm.conf file.

When starting slurm through slurmctld, I experience a couple of
issues as shown below my signature.

In these messages, I don't know what to make of
'Invalid RPC received 2030 while in standby mode'
and I don't understand why I get
'Neither primary nor backup controller responding, sleep and retry'
when I can successfully ping the primary controller (which I assume
is the same as ControlMachine).

Strangely, after I execute

$ /etc/init.d/slurm start

The system seems to show that the primary/backup are up:

$ scontrol ping
Slurmctld(primary/backup) at master/bioshock are UP/UP

At this stage, if I execute 'scontrol show config', the command just
hangs and produces no output after several minutes. The command
'sinfo' also hangs.

If I then execute 'slurmctld' again, I get the same error messages
as shown below.


I'd appreciate any help or insights you can provide to help me
address these issues.

Thank you.

Kind regards,
Brad

==========

$ slurmctld -Dvvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given
so we are giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.1 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded
with argument 4
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: slurmctld running in background mode
slurmctld: debug3: _background_rpc_mgr pid = 32571
slurmctld: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug3: Success.
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug:  Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug:  Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug:  Neither primary nor backup controller responding,
sleep and retry
...

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Reply via email to