[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread Benjamin Redling

Hi,

are you both working on the same cluster as the OP?

On 10/25/2016 08:12, suprita.bot...@wipro.com wrote:
> I have installed slurm on a 2 node cluster.
> 
> On the master node when I run sinfo command I get below output.
[...]

> But on compute node:Slurmd daemon is also running but it gives the error:
> 
> Unable to contact slurm controller (connect failure).

> I am not able to understand the error , why this error exists.Although
> in the master node sinfo output state of this node is coming out to be idle.


Have you copied the _exact_ slurm.conf from the master to the compute node?

Regards,
Benjamin

-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread Christopher Samuel

On 25/10/16 10:05, Peixin Qiao wrote:
> 
> I installed slurm-llnl on Debian on one computer. When I ran slurmctld
> and slurmd, I got the error:
> slurm_load_partitions: Unable to contact slurm controller (connect failure).

Check your firewall rules to ensure that those connections aren't
getting blocked, and also check that the hostname correctly resolves.

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread champak dutta
Hi,
1) disabled selinux
2) stop iptables.
3) check date and time in both machines.  It should same time.
4) restart slurm service in controller and and node.

Regards
Champak

On 25 Oct 2016 11:43 am,  wrote:

Hi ,



I have installed slurm on a 2 node cluster.

On the master node when I run sinfo command I get below output.

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

debug*   up   infinite  2   idle punehpcdl[01-02]



But on compute node:Slurmd daemon is also running but it gives the error:

Unable to contact slurm controller (connect failure).







I am not able to understand the error , why this error exists.Although in
the master node sinfo output state of this node is coming out to be idle.



Please help on this.









Thanks & Regards

Suprita Bothra





*From:* Peixin Qiao [mailto:pq...@hawk.iit.edu]
*Sent:* Tuesday, October 25, 2016 4:36 AM
*To:* slurm-dev 
*Subject:* [slurm-dev] slurm_load_partitions: Unable to contact slurm
controller (connect failure)



** This mail has been sent from an external source **

Hello,

I installed slurm-llnl on Debian on one computer. When I ran slurmctld and
slurmd, I got the error:
slurm_load_partitions: Unable to contact slurm controller (connect failure).

The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=debian
#ControlAddr=
#
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/openssl
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobCompType=jobcomp/none
JobCredentialPrivateKey = /usr/local/etc/slurm.key
JobCredentialPublicCertificate = /usr/local/etc/slurm.cert
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=debian CPUs=4 RealMemory=5837 Sockets=4
PartitionName=debug Nodes=debian Default=YES

Best Regards,

Peixin


The information contained in this electronic message and any attachments to
this message are intended for the exclusive use of the addressee(s) and may
contain proprietary, confidential or privileged information. If you are not
the intended recipient, you should not disseminate, distribute or copy this
e-mail. Please notify the sender immediately and destroy all copies of this
message and any attachments. WARNING: Computer viruses can be transmitted
via email. The recipient should check this email and any attachments for
the presence of viruses. The company accepts no liability for any damage
caused by any virus transmitted by this email. www.wipro.com


[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread suprita.bothra
Hi ,

I have installed slurm on a 2 node cluster.
On the master node when I run sinfo command I get below output.
sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  2   idle punehpcdl[01-02]

But on compute node:Slurmd daemon is also running but it gives the error:
Unable to contact slurm controller (connect failure).



I am not able to understand the error , why this error exists.Although in the 
master node sinfo output state of this node is coming out to be idle.

Please help on this.




Thanks & Regards
Suprita Bothra


From: Peixin Qiao [mailto:pq...@hawk.iit.edu]
Sent: Tuesday, October 25, 2016 4:36 AM
To: slurm-dev 
Subject: [slurm-dev] slurm_load_partitions: Unable to contact slurm controller 
(connect failure)


** This mail has been sent from an external source **
Hello,

I installed slurm-llnl on Debian on one computer. When I ran slurmctld and 
slurmd, I got the error:
slurm_load_partitions: Unable to contact slurm controller (connect failure).

The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=debian
#ControlAddr=
#
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/openssl
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobCompType=jobcomp/none
JobCredentialPrivateKey = /usr/local/etc/slurm.key
JobCredentialPublicCertificate = /usr/local/etc/slurm.cert
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=debian CPUs=4 RealMemory=5837 Sockets=4
PartitionName=debug Nodes=debian Default=YES
Best Regards,
Peixin

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com