[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)
Hi, are you both working on the same cluster as the OP? On 10/25/2016 08:12, suprita.bot...@wipro.com wrote: > I have installed slurm on a 2 node cluster. > > On the master node when I run sinfo command I get below output. [...] > But on compute node:Slurmd daemon is also running but it gives the error: > > Unable to contact slurm controller (connect failure). > I am not able to understand the error , why this error exists.Although > in the master node sinfo output state of this node is coming out to be idle. Have you copied the _exact_ slurm.conf from the master to the compute node? Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] Re: slurm_load_partitions: Unable to contact slurm controller (connect failure)
On 25/10/16 10:05, Peixin Qiao wrote: > > I installed slurm-llnl on Debian on one computer. When I ran slurmctld > and slurmd, I got the error: > slurm_load_partitions: Unable to contact slurm controller (connect failure). Check your firewall rules to ensure that those connections aren't getting blocked, and also check that the hostname correctly resolves. -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)
Hi, 1) disabled selinux 2) stop iptables. 3) check date and time in both machines. It should same time. 4) restart slurm service in controller and and node. Regards Champak On 25 Oct 2016 11:43 am,wrote: Hi , I have installed slurm on a 2 node cluster. On the master node when I run sinfo command I get below output. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle punehpcdl[01-02] But on compute node:Slurmd daemon is also running but it gives the error: Unable to contact slurm controller (connect failure). I am not able to understand the error , why this error exists.Although in the master node sinfo output state of this node is coming out to be idle. Please help on this. Thanks & Regards Suprita Bothra *From:* Peixin Qiao [mailto:pq...@hawk.iit.edu] *Sent:* Tuesday, October 25, 2016 4:36 AM *To:* slurm-dev *Subject:* [slurm-dev] slurm_load_partitions: Unable to contact slurm controller (connect failure) ** This mail has been sent from an external source ** Hello, I installed slurm-llnl on Debian on one computer. When I ran slurmctld and slurmd, I got the error: slurm_load_partitions: Unable to contact slurm controller (connect failure). The slurm.conf is as follows: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=debian #ControlAddr= # AuthType=auth/none CacheGroups=0 CryptoType=crypto/openssl #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool SwitchType=switch/none TaskPlugin=task/none # # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster JobCompType=jobcomp/none JobCredentialPrivateKey = /usr/local/etc/slurm.key JobCredentialPublicCertificate = /usr/local/etc/slurm.cert #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= # # # COMPUTE NODES NodeName=debian CPUs=4 RealMemory=5837 Sockets=4 PartitionName=debug Nodes=debian Default=YES Best Regards, Peixin The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)
Hi , I have installed slurm on a 2 node cluster. On the master node when I run sinfo command I get below output. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle punehpcdl[01-02] But on compute node:Slurmd daemon is also running but it gives the error: Unable to contact slurm controller (connect failure). I am not able to understand the error , why this error exists.Although in the master node sinfo output state of this node is coming out to be idle. Please help on this. Thanks & Regards Suprita Bothra From: Peixin Qiao [mailto:pq...@hawk.iit.edu] Sent: Tuesday, October 25, 2016 4:36 AM To: slurm-devSubject: [slurm-dev] slurm_load_partitions: Unable to contact slurm controller (connect failure) ** This mail has been sent from an external source ** Hello, I installed slurm-llnl on Debian on one computer. When I ran slurmctld and slurmd, I got the error: slurm_load_partitions: Unable to contact slurm controller (connect failure). The slurm.conf is as follows: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=debian #ControlAddr= # AuthType=auth/none CacheGroups=0 CryptoType=crypto/openssl #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool SwitchType=switch/none TaskPlugin=task/none # # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster JobCompType=jobcomp/none JobCredentialPrivateKey = /usr/local/etc/slurm.key JobCredentialPublicCertificate = /usr/local/etc/slurm.cert #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= # # # COMPUTE NODES NodeName=debian CPUs=4 RealMemory=5837 Sockets=4 PartitionName=debug Nodes=debian Default=YES Best Regards, Peixin The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com