[slurm-dev] Mslurm

2013-11-05 Thread Pancorbo, Juan
Hello all,

As we told in the SLURM User Group Meeting 
http://slurm.schedmd.com/SUG13/Mslurm.pdf , here, at Leibniz-Rechenzentrum 
(LRZ) we are working with several middle to small size clusters managed by 
slurm.

In our case we have all the slurm control daemons and the slurm database 
daemons running  on the same master node.

To do this we use a set of scripts called mslurm, which are now available to 
download at the slurm webpage.

In the downloads section there are an overview, the installation instructions, 
and a tgz with the scripts.

http://slurm.schedmd.com/download.html



you can find more information at the SUG13 presentation (Multi cluster 
management)

http://slurm.schedmd.com/SUG13/Mslurm.pdf
Please feel free to use it and modify it to fit your needs.

Regards

Juan Pancorbo Armada
juan.panco...@lrz.demailto:juan.panco...@lrz.de
http//www.lrz.de


Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon:  +49 (0) 89 35831-8735
Fax:  +49 (0) 89 35831-8535



[slurm-dev] Problem with reservations

2013-11-05 Thread Marcin Stolarek
Hi Guys,


I'm currently experiencing a problem with reservation. The job have been
submitted with appropriate --reservation  parameter, the reservation is
active and all nodes in reservation are in idle state.

Despite of this conditions job remains in pending state.  You can find
output from scontrol show command below.

Can you give me advice where can I found code responsible for running jobs
in reservation? I'm ussing backfill scheduler.


root@zdog:~# scontrol show job 33301
JobId=33301 Name=bash
   UserId=um(5830) GroupId=icm-meteo(105)
   Priority=58 Account=root QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2013-11-05T09:35:32 EligibleTime=2013-11-05T09:35:32
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=hydra AllocNode:Sid=hpc:27470
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=meteo
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/icm/home/um

root@zdog:~# scontrol show res meteo
ReservationName=meteo StartTime=2013-10-31T14:56:10
EndTime=2013-11-14T14:56:10 Duration=14-00:00:00
   Nodes=wn[2085,2091,2093,2095,2097] NodeCnt=5 CoreCnt=60
Features=intelx5660 PartitionName=hydra Flags=
   Users=um Accounts=(null) Licenses=(null) State=ACTIVE

root@zdog:~# scontrol show node wn2085
NodeName=wn2085 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04
Features=intelx5660,westmere,ib,qcg,noht
   Gres=(null)
   NodeAddr=wn2085 NodeHostName=wn2085
   OS=Linux RealMemory=24146 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=442266 Weight=20
   BootTime=2013-10-31T14:26:42 SlurmdStartTime=2013-11-04T13:36:31
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


cheers,
marcin


[slurm-dev] RE: Fwd: Failed to contact primary controller : No route to host

2013-11-05 Thread Ludovic Prevost
Hi,

 

Could you try to populate your /etc/hosts like : 

 

130.1.2.205 qdr1

130.1.2.206 qdr2

 

And try again : 

 

$ nc -v qdr1 6818

 

Best Regards,

PREVOST Ludovic

NEC HPC Europe

 

De : Arjun J Rao [mailto:rectangle.k...@gmail.com] 
Envoyé : mardi 5 novembre 2013 08:54
À : slurm-dev
Objet : [slurm-dev] Fwd: Failed to contact primary controller : No route to host

 

Have SLURM installed on two nodes qdr1 and qdr2 with IP addresses 130.1.2.205 
and 130.1.2.206. Started slurmctld on qdr1. Started slurmd on qdr1 and qdr2 
both.

 

The slurmd on qdr1 is running fine. But the slurmd on qdr2 gives the following 
error message : 

slurmd: debug2: _slurm_connect failed.: No route to host

slurmd: debug2: Error connecting slurm stream socket at 130.1.2.205:6817: No 
route to host 

slurmd: debug: Failed to contact primary controller: No route to host

 

Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this : 

Proto  Recv-QSend-Q   LocalAddress ForeignAddress  State

tcp0  0 0.0.0.0:6817   0.0.0.0:*
LISTEN

tcp0  0 0.0.0.0:6818   0.0.0.0:*
LISTEN

 

This shows that both slurmctld and slurmd on qdr1 are listening and talking to 
each other. 

But doing nc -zv qdr1 6818 from qdr2 gives me the following error : 

nc: Connect to qdr1 port 6818(tcp) failed: No route to host

 

Edit: pinging from qdr2 to qdr1 and vice versa works fine. 

 

  
http://smd-server.schedmd.local/cgi-bin/dada/mail.cgi/spacer_image/slurmdev/CAH+7PqJ2xLLXC_QOkkzjKhWiimcLiY-krWyTv7i6CbxMsrd01Q@mail/spacer.png
 

Click here 
https://www.mailcontrol.com/sr/P3xCqCWRQUrGX2PQPOmvUpjAP6sbmVCmPcp!iZPPdgndfOK8goMOhhJrKnC5EphNoTyr8GNFXkvLq66BGWNXYg==
  to report this email as spam.



smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] RE: Fwd: Failed to contact primary controller : No route to host

2013-11-05 Thread Arjun J Rao
My /etc/hosts alread has those entries. And like I mentioned, I can ping
from qdr2 to qdr1. But nc -v qdr1 6818 shows that there is no route.


On Tue, Nov 5, 2013 at 9:58 AM, Ludovic Prevost 
ludovic.prev...@emea.nec.com wrote:

 Hi,



 Could you try to populate your /etc/hosts like :



 130.1.2.205 qdr1

 130.1.2.206 qdr2



 And try again :



 $ nc -v qdr1 6818



 Best Regards,

 PREVOST Ludovic

 NEC HPC Europe



 *De :* Arjun J Rao [mailto:rectangle.k...@gmail.com]
 *Envoyé :* mardi 5 novembre 2013 08:54
 *À :* slurm-dev
 *Objet :* [slurm-dev] Fwd: Failed to contact primary controller : No
 route to host



 Have SLURM installed on two nodes qdr1 and qdr2 with IP addresses
 130.1.2.205 and 130.1.2.206. Started slurmctld on qdr1. Started slurmd on
 qdr1 and qdr2 both.



 The slurmd on qdr1 is running fine. But the slurmd on qdr2 gives the
 following error message :

 slurmd: debug2: _slurm_connect failed.: No route to host

 slurmd: debug2: Error connecting slurm stream socket at 130.1.2.205:6817:
 No route to host

 slurmd: debug: Failed to contact primary controller: No route to host



 Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows this :

 Proto  Recv-QSend-Q   LocalAddress ForeignAddress
 State

 tcp0  0 0.0.0.0:6817   0.0.0.0:*
 LISTEN

 tcp0  0 0.0.0.0:6818   0.0.0.0:*
 LISTEN



 This shows that both slurmctld and slurmd on qdr1 are listening and
 talking to each other.

 But doing nc -zv qdr1 6818 from qdr2 gives me the following error :

 nc: Connect to qdr1 port 6818(tcp) failed: No route to host



 Edit: pinging from qdr2 to qdr1 and vice versa works fine.



 Click 
 herehttps://www.mailcontrol.com/sr/P3xCqCWRQUrGX2PQPOmvUpjAP6sbmVCmPcp%21iZPPdgndfOK8goMOhhJrKnC5EphNoTyr8GNFXkvLq66BGWNXYg==to
  report this email as spam.



[slurm-dev] Re: Problem with reservations

2013-11-05 Thread Moe Jette


See the function schedule() in src/slurmctld/job_scheduler.c (main  
scheduling logic) and _attempt_backfill() in  
src/plugins/sched/backfill/backfill.c (backfill scheduler). Look in  
both places for calls to the function job_test_resv().


Quoting Marcin Stolarek stolarek.mar...@gmail.com:


Hi Guys,


I'm currently experiencing a problem with reservation. The job have been
submitted with appropriate --reservation  parameter, the reservation is
active and all nodes in reservation are in idle state.

Despite of this conditions job remains in pending state.  You can find
output from scontrol show command below.

Can you give me advice where can I found code responsible for running jobs
in reservation? I'm ussing backfill scheduler.


root@zdog:~# scontrol show job 33301
JobId=33301 Name=bash
   UserId=um(5830) GroupId=icm-meteo(105)
   Priority=58 Account=root QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2013-11-05T09:35:32 EligibleTime=2013-11-05T09:35:32
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=hydra AllocNode:Sid=hpc:27470
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=meteo
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/icm/home/um

root@zdog:~# scontrol show res meteo
ReservationName=meteo StartTime=2013-10-31T14:56:10
EndTime=2013-11-14T14:56:10 Duration=14-00:00:00
   Nodes=wn[2085,2091,2093,2095,2097] NodeCnt=5 CoreCnt=60
Features=intelx5660 PartitionName=hydra Flags=
   Users=um Accounts=(null) Licenses=(null) State=ACTIVE

root@zdog:~# scontrol show node wn2085
NodeName=wn2085 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.04
Features=intelx5660,westmere,ib,qcg,noht
   Gres=(null)
   NodeAddr=wn2085 NodeHostName=wn2085
   OS=Linux RealMemory=24146 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=442266 Weight=20
   BootTime=2013-10-31T14:26:42 SlurmdStartTime=2013-11-04T13:36:31
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


cheers,
marcin





[slurm-dev] Oversubscription of GPU resources

2013-11-05 Thread Ulf Markwardt

Dear list,

how can I oversubscribe a few of our GPU cards (general resource) so 
that a certain number of users might share the node AND the card for 
development purposes.


Thanks,
Ulf


--
___
Dr. Ulf Markwardt

Dresden University of Technology
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640  WWW:  http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Slurm not working with gcc HARDEND -- was Re: Starring slurmd on Gentoo Linux

2013-11-05 Thread Daniel M. Weeks
On 10/31/2013 06:33 PM, Olaf Leidinger wrote:
 
 Hi Dan,
 
 problems you're seeing are hardened-specific.
 
 Yes, that's what I suspected. At your third link, one can read:
 
 I found your Gentoo bug [2]. Could you please attach the logs from the
 failed build (with hardened gcc) to the ticket so I can take a look at them?
 
 It's attached now, however, the build doesn't fail. Only running fails.
 
 https://bugs.gentoo.org/attachment.cgi?id=362374
  
 I believe you're being bitten by Slurm loading plugins dynamically at
 runtime and the build on hardened will need to be tweaked to accommodate
 this. (See Issues arising from default NOW on the hardened toolchain
 page. [3])
 
 Sounds similar. I don't know what X and transcode do differently, but
 from the debug output I learn that slurm uses dlopen, which is the
 standard way mentioned in the article, isn't it?
 
 Regards,
 
 Olaf
 

Hey Olaf,

I noticed the Gentoo bug report was closed without a real resolution.
(The dev suggesting users update their system toolchain to fix a problem
like this is asinine.)

I'm working on a patch that covers only the Makefiles where it's
necessary but you should have better luck if you run emerge with
LDFLAGS=-Wl,-z,lazy. (You can also use the attached ebuild which will
use this flag only if a hardened gcc profile is selected.)

Regards,
Dan

-- 
Daniel M. Weeks
Systems Programmer
Center for Computational Innovations
Rensselaer Polytechnic Institute
Troy, NY 12180
518-276-4458
# Copyright 1999-2013 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
# $Header: /var/cvsroot/gentoo-x86/sys-cluster/slurm/slurm-2.5.6.ebuild,v 1.1 
2013/06/02 19:46:40 alexxy Exp $

EAPI=4

if [[ ${PV} == ** ]]; then
EGIT_REPO_URI=git://github.com/SchedMD/slurm.git
INHERIT_GIT=git-2
SRC_URI=
KEYWORDS=
else
inherit versionator
if [[ ${PV} == *pre* || ${PV} == *rc* ]]; then
MY_PV=$(replace_version_separator 3 '-0.') # pre-releases or 
release-candidate
else
MY_PV=$(replace_version_separator 3 '-') # stable releases
fi
MY_P=${PN}-${MY_PV}
INHERIT_GIT=
SRC_URI=http://www.schedmd.com/download/total/${MY_P}.tar.bz2;
KEYWORDS=~amd64 ~x86
S=${WORKDIR}/${MY_P}
fi

inherit autotools base eutils pam perl-module user toolchain-funcs flag-o-matic 
${INHERIT_GIT}

DESCRIPTION=SLURM: A Highly Scalable Resource Manager
HOMEPAGE=http://www.schedmd.com;

LICENSE=GPL-2
SLOT=0
IUSE=lua maui multiple-slurmd +munge mysql pam perl postgres ssl static-libs 
torque ypbind

DEPEND=
!sys-cluster/torque
!net-analyzer/slurm
!net-analyzer/sinfo
mysql? ( dev-db/mysql )
munge? ( sys-auth/munge )
ypbind? ( net-nds/ypbind )
pam? ( virtual/pam )
postgres? ( dev-db/postgresql-base )
ssl? ( dev-libs/openssl )
lua? ( dev-lang/lua )
!lua? ( !dev-lang/lua )
=sys-apps/hwloc-1.1.1-r1
RDEPEND=${DEPEND}
dev-libs/libcgroup
maui? ( sys-cluster/maui[slurm] )

REQUIRED_USE=torque? ( perl )

LIBSLURM_PERL_S=${WORKDIR}/${P}/contribs/perlapi/libslurm/perl
LIBSLURMDB_PERL_S=${WORKDIR}/${P}/contribs/perlapi/libslurmdb/perl

RESTRICT=primaryuri

PATCHES=(
${FILESDIR}/${PN}-2.5.4-nogtk.patch
)

src_unpack() {
if [[ ${PV} == ** ]]; then
git-2_src_unpack
else
default
fi
}

pkg_setup() {
enewgroup slurm 500
enewuser slurm 500 -1 /var/spool/slurm slurm
}

src_prepare() {
# Gentoo uses /sys/fs/cgroup instead of /cgroup
# FIXME: Can the ^/cgroup and \([ =\]\)/cgroup patterns be merged?
sed \
-e 's|\([ =\]\)/cgroup|\1/sys/fs/cgroup|g' \
-e s|^/cgroup|/sys/fs/cgroup|g \
-i ${S}/doc/man/man5/cgroup.conf.5 \
-i ${S}/etc/cgroup.release_common.example \
-i ${S}/src/common/xcgroup_read_config.c \
|| die Can't sed /cgroup for /sys/fs/cgroup
# and pids should go to /var/run/slurm
sed -e 's:/var/run/slurmctld.pid:/var/run/slurm/slurmctld.pid:g' \
-e 's:/var/run/slurmd.pid:/var/run/slurm/slurmd.pid:g' \
-i ${S}/etc/slurm.conf.example \
|| die Can't sed for /var/run/slurmctld.pid
# also state dirs are in /var/spool/slurm
sed -e 's:StateSaveLocation=*.:StateSaveLocation=/var/spool/slurm:g' \
-e 
's:SlurmdSpoolDir=*.:SlurmdSpoolDir=/var/spool/slurm/slurmd:g' \
-i ${S}/etc/slurm.conf.example \
|| die Can't sed ${S}/etc/slurm.conf.example for 
StateSaveLocation=*. or SlurmdSpoolDir=*
# and tmp should go to /var/tmp/slurm
sed -e 's:/tmp:/var/tmp:g' \
-i ${S}/etc/slurm.conf.example \
|| die Can't sed for StateSaveLocation=*./tmp
# 

[slurm-dev] Re: Oversubscription of GPU resources

2013-11-05 Thread Moe Jette


You would need to configure the GPU(s) multiple times in slurm.conf  
and gres.conf, but duplicate the name in the gres.conf File option  
like this:


# Configure GPU zero to be allocated twice
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia0


Quoting Ulf Markwardt ulf.markwa...@tu-dresden.de:


Dear list,

how can I oversubscribe a few of our GPU cards (general resource) so  
that a certain number of users might share the node AND the card for  
development purposes.


Thanks,
Ulf


--
___
Dr. Ulf Markwardt

Dresden University of Technology
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640  WWW:  http://www.tu-dresden.de/zih






[slurm-dev] Re: Failed to contact primary controller : No route to host

2013-11-05 Thread Christopher Samuel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/11/13 18:52, Arjun J Rao wrote:

 Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows
 this :
 
 Proto  Recv-QSend-Q   LocalAddress ForeignAddress
 State tcp0  0 0.0.0.0:6817 
 http://0.0.0.0:6817   0.0.0.0:*
 LISTEN
 
 tcp0  0 0.0.0.0:6818 
 http://0.0.0.0:6818   0.0.0.0:*
 LISTEN
 
 
 This shows that both slurmctld and slurmd on qdr1 are listening
 and talking to each other.

No, it shows that the processes are listening, it does not show
whether or not they are communicating, or can even connect.

 But doing nc -zv qdr1 6818 from qdr2 gives me the following error
 :
 
 nc: Connect to qdr1 port 6818(tcp) failed: No route to host

That would tend to imply you've got firewall rules that are preventing
the communication between the nodes, you'll need to check that.

Good luck!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlJ5fO0ACgkQO2KABBYQAh9BqQCfUIFABgCrVc92bKTX28IqqsT9
KsEAn2PobnG+Y4UqupyQv7ILlINycIKX
=FTlZ
-END PGP SIGNATURE-


[slurm-dev] Re: Failed to contact primary controller : No route to host

2013-11-05 Thread Arjun J Rao
Yes, it was the firewall rules on my Scientific Linux installation. Flushed
the iptables using iptables -F and now the slurm daemons talk with the
slurm controller just fine.


On Tue, Nov 5, 2013 at 11:21 PM, Christopher Samuel
sam...@unimelb.edu.auwrote:


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 05/11/13 18:52, Arjun J Rao wrote:

  Now I have tried netstat -lnt on qdr1(130.1.2.205) and it shows
  this :
 
  Proto  Recv-QSend-Q   LocalAddress ForeignAddress
  State tcp0  0 0.0.0.0:6817
  http://0.0.0.0:6817   0.0.0.0:*
  LISTEN
 
  tcp0  0 0.0.0.0:6818
  http://0.0.0.0:6818   0.0.0.0:*
  LISTEN
 
 
  This shows that both slurmctld and slurmd on qdr1 are listening
  and talking to each other.

 No, it shows that the processes are listening, it does not show
 whether or not they are communicating, or can even connect.

  But doing nc -zv qdr1 6818 from qdr2 gives me the following error
  :
 
  nc: Connect to qdr1 port 6818(tcp) failed: No route to host

 That would tend to imply you've got firewall rules that are preventing
 the communication between the nodes, you'll need to check that.

 Good luck!
 Chris
 - --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iEYEARECAAYFAlJ5fO0ACgkQO2KABBYQAh9BqQCfUIFABgCrVc92bKTX28IqqsT9
 KsEAn2PobnG+Y4UqupyQv7ILlINycIKX
 =FTlZ
 -END PGP SIGNATURE-