[slurm-dev] Mslurm
Hello all, As we told in the SLURM User Group Meeting http://slurm.schedmd.com/SUG13/Mslurm.pdf , here, at Leibniz-Rechenzentrum (LRZ) we are working with several middle to small size clusters managed by slurm. In our case we have all the slurm control daemons and the slurm database daemons running on the same master node. To do this we use a set of scripts called mslurm, which are now available to download at the slurm webpage. In the downloads section there are an overview, the installation instructions, and a tgz with the scripts. http://slurm.schedmd.com/download.html you can find more information at the SUG13 presentation (Multi cluster management) http://slurm.schedmd.com/SUG13/Mslurm.pdf Please feel free to use it and modify it to fit your needs. Regards Juan Pancorbo Armada juan.panco...@lrz.demailto:juan.panco...@lrz.de http//www.lrz.de Leibniz-Rechenzentrum Abteilung: Hochleistungssysteme Boltzmannstrasse 1, 85748 Garching Telefon: +49 (0) 89 35831-8735 Fax: +49 (0) 89 35831-8535
[slurm-dev] Problem with reservations
Hi Guys, I'm currently experiencing a problem with reservation. The job have been submitted with appropriate --reservation parameter, the reservation is active and all nodes in reservation are in idle state. Despite of this conditions job remains in pending state. You can find output from scontrol show command below. Can you give me advice where can I found code responsible for running jobs in reservation? I'm ussing backfill scheduler. root@zdog:~# scontrol show job 33301 JobId=33301 Name=bash UserId=um(5830) GroupId=icm-meteo(105) Priority=58 Account=root QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2013-11-05T09:35:32 EligibleTime=2013-11-05T09:35:32 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=hydra AllocNode:Sid=hpc:27470 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=meteo Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/icm/home/um root@zdog:~# scontrol show res meteo ReservationName=meteo StartTime=2013-10-31T14:56:10 EndTime=2013-11-14T14:56:10 Duration=14-00:00:00 Nodes=wn[2085,2091,2093,2095,2097] NodeCnt=5 CoreCnt=60 Features=intelx5660 PartitionName=hydra Flags= Users=um Accounts=(null) Licenses=(null) State=ACTIVE root@zdog:~# scontrol show node wn2085 NodeName=wn2085 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04 Features=intelx5660,westmere,ib,qcg,noht Gres=(null) NodeAddr=wn2085 NodeHostName=wn2085 OS=Linux RealMemory=24146 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=442266 Weight=20 BootTime=2013-10-31T14:26:42 SlurmdStartTime=2013-11-04T13:36:31 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s cheers, marcin
[slurm-dev] RE: Fwd: Failed to contact primary controller : No route to host
Hi, Could you try to populate your /etc/hosts like : 130.1.2.205 qdr1 130.1.2.206 qdr2 And try again : $ nc -v qdr1 6818 Best Regards, PREVOST Ludovic NEC HPC Europe De : Arjun J Rao [mailto:rectangle.k...@gmail.com] Envoyé : mardi 5 novembre 2013 08:54 À : slurm-dev Objet : [slurm-dev] Fwd: Failed to contact primary controller : No route to host Have SLURM installed on two nodes qdr1 and qdr2 with IP addresses 130.1.2.205 and 130.1.2.206. Started slurmctld on qdr1. Started slurmd on qdr1 and qdr2 both. The slurmd on qdr1 is running fine. But the slurmd on qdr2 gives the following error message : slurmd: debug2: _slurm_connect failed.: No route to host slurmd: debug2: Error connecting slurm stream socket at 130.1.2.205:6817: No route to host slurmd: debug: Failed to contact primary controller: No route to host Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this : Proto Recv-QSend-Q LocalAddress ForeignAddress State tcp0 0 0.0.0.0:6817 0.0.0.0:* LISTEN tcp0 0 0.0.0.0:6818 0.0.0.0:* LISTEN This shows that both slurmctld and slurmd on qdr1 are listening and talking to each other. But doing nc -zv qdr1 6818 from qdr2 gives me the following error : nc: Connect to qdr1 port 6818(tcp) failed: No route to host Edit: pinging from qdr2 to qdr1 and vice versa works fine. http://smd-server.schedmd.local/cgi-bin/dada/mail.cgi/spacer_image/slurmdev/CAH+7PqJ2xLLXC_QOkkzjKhWiimcLiY-krWyTv7i6CbxMsrd01Q@mail/spacer.png Click here https://www.mailcontrol.com/sr/P3xCqCWRQUrGX2PQPOmvUpjAP6sbmVCmPcp!iZPPdgndfOK8goMOhhJrKnC5EphNoTyr8GNFXkvLq66BGWNXYg== to report this email as spam. smime.p7s Description: S/MIME cryptographic signature
[slurm-dev] RE: Fwd: Failed to contact primary controller : No route to host
My /etc/hosts alread has those entries. And like I mentioned, I can ping from qdr2 to qdr1. But nc -v qdr1 6818 shows that there is no route. On Tue, Nov 5, 2013 at 9:58 AM, Ludovic Prevost ludovic.prev...@emea.nec.com wrote: Hi, Could you try to populate your /etc/hosts like : 130.1.2.205 qdr1 130.1.2.206 qdr2 And try again : $ nc -v qdr1 6818 Best Regards, PREVOST Ludovic NEC HPC Europe *De :* Arjun J Rao [mailto:rectangle.k...@gmail.com] *Envoyé :* mardi 5 novembre 2013 08:54 *À :* slurm-dev *Objet :* [slurm-dev] Fwd: Failed to contact primary controller : No route to host Have SLURM installed on two nodes qdr1 and qdr2 with IP addresses 130.1.2.205 and 130.1.2.206. Started slurmctld on qdr1. Started slurmd on qdr1 and qdr2 both. The slurmd on qdr1 is running fine. But the slurmd on qdr2 gives the following error message : slurmd: debug2: _slurm_connect failed.: No route to host slurmd: debug2: Error connecting slurm stream socket at 130.1.2.205:6817: No route to host slurmd: debug: Failed to contact primary controller: No route to host Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this : Proto Recv-QSend-Q LocalAddress ForeignAddress State tcp0 0 0.0.0.0:6817 0.0.0.0:* LISTEN tcp0 0 0.0.0.0:6818 0.0.0.0:* LISTEN This shows that both slurmctld and slurmd on qdr1 are listening and talking to each other. But doing nc -zv qdr1 6818 from qdr2 gives me the following error : nc: Connect to qdr1 port 6818(tcp) failed: No route to host Edit: pinging from qdr2 to qdr1 and vice versa works fine. Click herehttps://www.mailcontrol.com/sr/P3xCqCWRQUrGX2PQPOmvUpjAP6sbmVCmPcp%21iZPPdgndfOK8goMOhhJrKnC5EphNoTyr8GNFXkvLq66BGWNXYg==to report this email as spam.
[slurm-dev] Re: Problem with reservations
See the function schedule() in src/slurmctld/job_scheduler.c (main scheduling logic) and _attempt_backfill() in src/plugins/sched/backfill/backfill.c (backfill scheduler). Look in both places for calls to the function job_test_resv(). Quoting Marcin Stolarek stolarek.mar...@gmail.com: Hi Guys, I'm currently experiencing a problem with reservation. The job have been submitted with appropriate --reservation parameter, the reservation is active and all nodes in reservation are in idle state. Despite of this conditions job remains in pending state. You can find output from scontrol show command below. Can you give me advice where can I found code responsible for running jobs in reservation? I'm ussing backfill scheduler. root@zdog:~# scontrol show job 33301 JobId=33301 Name=bash UserId=um(5830) GroupId=icm-meteo(105) Priority=58 Account=root QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2013-11-05T09:35:32 EligibleTime=2013-11-05T09:35:32 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=hydra AllocNode:Sid=hpc:27470 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=meteo Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/icm/home/um root@zdog:~# scontrol show res meteo ReservationName=meteo StartTime=2013-10-31T14:56:10 EndTime=2013-11-14T14:56:10 Duration=14-00:00:00 Nodes=wn[2085,2091,2093,2095,2097] NodeCnt=5 CoreCnt=60 Features=intelx5660 PartitionName=hydra Flags= Users=um Accounts=(null) Licenses=(null) State=ACTIVE root@zdog:~# scontrol show node wn2085 NodeName=wn2085 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04 Features=intelx5660,westmere,ib,qcg,noht Gres=(null) NodeAddr=wn2085 NodeHostName=wn2085 OS=Linux RealMemory=24146 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=442266 Weight=20 BootTime=2013-10-31T14:26:42 SlurmdStartTime=2013-11-04T13:36:31 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s cheers, marcin
[slurm-dev] Oversubscription of GPU resources
Dear list, how can I oversubscribe a few of our GPU cards (general resource) so that a certain number of users might share the node AND the card for development purposes. Thanks, Ulf -- ___ Dr. Ulf Markwardt Dresden University of Technology Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Slurm not working with gcc HARDEND -- was Re: Starring slurmd on Gentoo Linux
On 10/31/2013 06:33 PM, Olaf Leidinger wrote: Hi Dan, problems you're seeing are hardened-specific. Yes, that's what I suspected. At your third link, one can read: I found your Gentoo bug [2]. Could you please attach the logs from the failed build (with hardened gcc) to the ticket so I can take a look at them? It's attached now, however, the build doesn't fail. Only running fails. https://bugs.gentoo.org/attachment.cgi?id=362374 I believe you're being bitten by Slurm loading plugins dynamically at runtime and the build on hardened will need to be tweaked to accommodate this. (See Issues arising from default NOW on the hardened toolchain page. [3]) Sounds similar. I don't know what X and transcode do differently, but from the debug output I learn that slurm uses dlopen, which is the standard way mentioned in the article, isn't it? Regards, Olaf Hey Olaf, I noticed the Gentoo bug report was closed without a real resolution. (The dev suggesting users update their system toolchain to fix a problem like this is asinine.) I'm working on a patch that covers only the Makefiles where it's necessary but you should have better luck if you run emerge with LDFLAGS=-Wl,-z,lazy. (You can also use the attached ebuild which will use this flag only if a hardened gcc profile is selected.) Regards, Dan -- Daniel M. Weeks Systems Programmer Center for Computational Innovations Rensselaer Polytechnic Institute Troy, NY 12180 518-276-4458 # Copyright 1999-2013 Gentoo Foundation # Distributed under the terms of the GNU General Public License v2 # $Header: /var/cvsroot/gentoo-x86/sys-cluster/slurm/slurm-2.5.6.ebuild,v 1.1 2013/06/02 19:46:40 alexxy Exp $ EAPI=4 if [[ ${PV} == ** ]]; then EGIT_REPO_URI=git://github.com/SchedMD/slurm.git INHERIT_GIT=git-2 SRC_URI= KEYWORDS= else inherit versionator if [[ ${PV} == *pre* || ${PV} == *rc* ]]; then MY_PV=$(replace_version_separator 3 '-0.') # pre-releases or release-candidate else MY_PV=$(replace_version_separator 3 '-') # stable releases fi MY_P=${PN}-${MY_PV} INHERIT_GIT= SRC_URI=http://www.schedmd.com/download/total/${MY_P}.tar.bz2; KEYWORDS=~amd64 ~x86 S=${WORKDIR}/${MY_P} fi inherit autotools base eutils pam perl-module user toolchain-funcs flag-o-matic ${INHERIT_GIT} DESCRIPTION=SLURM: A Highly Scalable Resource Manager HOMEPAGE=http://www.schedmd.com; LICENSE=GPL-2 SLOT=0 IUSE=lua maui multiple-slurmd +munge mysql pam perl postgres ssl static-libs torque ypbind DEPEND= !sys-cluster/torque !net-analyzer/slurm !net-analyzer/sinfo mysql? ( dev-db/mysql ) munge? ( sys-auth/munge ) ypbind? ( net-nds/ypbind ) pam? ( virtual/pam ) postgres? ( dev-db/postgresql-base ) ssl? ( dev-libs/openssl ) lua? ( dev-lang/lua ) !lua? ( !dev-lang/lua ) =sys-apps/hwloc-1.1.1-r1 RDEPEND=${DEPEND} dev-libs/libcgroup maui? ( sys-cluster/maui[slurm] ) REQUIRED_USE=torque? ( perl ) LIBSLURM_PERL_S=${WORKDIR}/${P}/contribs/perlapi/libslurm/perl LIBSLURMDB_PERL_S=${WORKDIR}/${P}/contribs/perlapi/libslurmdb/perl RESTRICT=primaryuri PATCHES=( ${FILESDIR}/${PN}-2.5.4-nogtk.patch ) src_unpack() { if [[ ${PV} == ** ]]; then git-2_src_unpack else default fi } pkg_setup() { enewgroup slurm 500 enewuser slurm 500 -1 /var/spool/slurm slurm } src_prepare() { # Gentoo uses /sys/fs/cgroup instead of /cgroup # FIXME: Can the ^/cgroup and \([ =\]\)/cgroup patterns be merged? sed \ -e 's|\([ =\]\)/cgroup|\1/sys/fs/cgroup|g' \ -e s|^/cgroup|/sys/fs/cgroup|g \ -i ${S}/doc/man/man5/cgroup.conf.5 \ -i ${S}/etc/cgroup.release_common.example \ -i ${S}/src/common/xcgroup_read_config.c \ || die Can't sed /cgroup for /sys/fs/cgroup # and pids should go to /var/run/slurm sed -e 's:/var/run/slurmctld.pid:/var/run/slurm/slurmctld.pid:g' \ -e 's:/var/run/slurmd.pid:/var/run/slurm/slurmd.pid:g' \ -i ${S}/etc/slurm.conf.example \ || die Can't sed for /var/run/slurmctld.pid # also state dirs are in /var/spool/slurm sed -e 's:StateSaveLocation=*.:StateSaveLocation=/var/spool/slurm:g' \ -e 's:SlurmdSpoolDir=*.:SlurmdSpoolDir=/var/spool/slurm/slurmd:g' \ -i ${S}/etc/slurm.conf.example \ || die Can't sed ${S}/etc/slurm.conf.example for StateSaveLocation=*. or SlurmdSpoolDir=* # and tmp should go to /var/tmp/slurm sed -e 's:/tmp:/var/tmp:g' \ -i ${S}/etc/slurm.conf.example \ || die Can't sed for StateSaveLocation=*./tmp #
[slurm-dev] Re: Oversubscription of GPU resources
You would need to configure the GPU(s) multiple times in slurm.conf and gres.conf, but duplicate the name in the gres.conf File option like this: # Configure GPU zero to be allocated twice Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia0 Quoting Ulf Markwardt ulf.markwa...@tu-dresden.de: Dear list, how can I oversubscribe a few of our GPU cards (general resource) so that a certain number of users might share the node AND the card for development purposes. Thanks, Ulf -- ___ Dr. Ulf Markwardt Dresden University of Technology Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: (+49) 351/463-33640 WWW: http://www.tu-dresden.de/zih
[slurm-dev] Re: Failed to contact primary controller : No route to host
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/11/13 18:52, Arjun J Rao wrote: Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this : Proto Recv-QSend-Q LocalAddress ForeignAddress State tcp0 0 0.0.0.0:6817 http://0.0.0.0:6817 0.0.0.0:* LISTEN tcp0 0 0.0.0.0:6818 http://0.0.0.0:6818 0.0.0.0:* LISTEN This shows that both slurmctld and slurmd on qdr1 are listening and talking to each other. No, it shows that the processes are listening, it does not show whether or not they are communicating, or can even connect. But doing nc -zv qdr1 6818 from qdr2 gives me the following error : nc: Connect to qdr1 port 6818(tcp) failed: No route to host That would tend to imply you've got firewall rules that are preventing the communication between the nodes, you'll need to check that. Good luck! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlJ5fO0ACgkQO2KABBYQAh9BqQCfUIFABgCrVc92bKTX28IqqsT9 KsEAn2PobnG+Y4UqupyQv7ILlINycIKX =FTlZ -END PGP SIGNATURE-
[slurm-dev] Re: Failed to contact primary controller : No route to host
Yes, it was the firewall rules on my Scientific Linux installation. Flushed the iptables using iptables -F and now the slurm daemons talk with the slurm controller just fine. On Tue, Nov 5, 2013 at 11:21 PM, Christopher Samuel sam...@unimelb.edu.auwrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/11/13 18:52, Arjun J Rao wrote: Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this : Proto Recv-QSend-Q LocalAddress ForeignAddress State tcp0 0 0.0.0.0:6817 http://0.0.0.0:6817 0.0.0.0:* LISTEN tcp0 0 0.0.0.0:6818 http://0.0.0.0:6818 0.0.0.0:* LISTEN This shows that both slurmctld and slurmd on qdr1 are listening and talking to each other. No, it shows that the processes are listening, it does not show whether or not they are communicating, or can even connect. But doing nc -zv qdr1 6818 from qdr2 gives me the following error : nc: Connect to qdr1 port 6818(tcp) failed: No route to host That would tend to imply you've got firewall rules that are preventing the communication between the nodes, you'll need to check that. Good luck! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlJ5fO0ACgkQO2KABBYQAh9BqQCfUIFABgCrVc92bKTX28IqqsT9 KsEAn2PobnG+Y4UqupyQv7ILlINycIKX =FTlZ -END PGP SIGNATURE-