[slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk

2017-10-10 Thread Uwe Sauter
Hi, see the man page for slurm.conf: TmpFS Fully qualified pathname of the file system available to user jobs for temporary storage. This parameter is used in establishing a node's TmpDisk space. The default value is "/tmp". So it is using /tmp. You need to change that parameter to

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter
> Yes, this is possible, but I would say it's discouraged to do so. > With RHEL/CentOS 7 you really should be using firewalld, and forget about the > old iptables. Here's a nice introduction: > https://www.certdepot.net/rhel7-get-started-firewalld/ > > Having worked with firewalld for a while

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter
Alternatively you can systemctl disable firewalld.service systemctl mask firewalld.service yum install iptables-services systemctl enable iptables.service ip6tables.service and configure configure iptables in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables, then systemctl

[slurm-dev] squeue bug / documentation error

2017-05-03 Thread Uwe Sauter
- tition name then within a given partition by increasing step id). Regards, Uwe Sauter

[slurm-dev] Re: jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Uwe Sauter
: Is there a time limit set on the queue (rather than the user)? On 04/26/2017 12:57 PM, Uwe Sauter wrote: Hi all, I have a mysterios situation where a user's job is killed after 24h though he specified "-t 7-00:00:00" on submission. This happened to several jobs of this user in the las

[slurm-dev] jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Uwe Sauter
Hi all, I have a mysterios situation where a user's job is killed after 24h though he specified "-t 7-00:00:00" on submission. This happened to several jobs of this user in the last few days. The account he's using is has MaxWall set to 7-00:00:00. There is no QoS used. In

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Uwe Sauter
On modern systems, nscd or nslcd should have been replaced by sssd. sssd has much better caching then the older services. Am 11.04.2017 um 17:17 schrieb Benjamin Redling: > > AFAIK most request never hit LDAP servers. > In production there is always a cache on the client side -- nscd might >

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Uwe Sauter
Ray, if you're going with the easy "copy" method just be sure that the nodes are all in the same state (user management-wise) before you do your first copy. Otherwise you might accidentally delete already existing users. I also encourage you to have a look into Ansible which makes it easy to

[slurm-dev] Re: LDAP required?

2017-04-10 Thread Uwe Sauter
For someone with no experience in LDAP deployment, yes, LDAP is a big issue. And depending on the cluster size, there are different possibilities. >From a different point of view: tools like Salt/Ansible/… will require almost >everytime some kind of local storage (local installation of OS)

[slurm-dev] Re: Strange hostlist/malloc error

2016-12-19 Thread Uwe Sauter
Do you have limits (per partition / group), QoS (with limits per user), etc configured? Am 19.12.2016 um 15:52 schrieb Wiegand, Paul: > Greetings, > > > We were running slurm 16.05.0 and just upgraded to 16.05.7 during our Fall > maintenance cycle along with other changes. > Now we are

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Uwe Sauter
15.12.2016 um 11:26 schrieb Stefan Doerr: > But this doesn't answer my question why it reports 10 times as much memory > usage as it is actually using, no? > > On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter <uwe.sauter...@gmail.com > <mailto:uwe.sauter...@gmail.com>> wrote:

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-14 Thread Uwe Sauter
There are only two memory related options "--mem" and "--mem-per-cpu". --mem tells slurm the memory requirement of the job (if used with sbatch) or the step (if used with srun). But not the requirement of each process. --mem-per-cpu is used in combination with --ntasks and --cpus-per-task. If

[slurm-dev] Jobs get killed after upgradt from 15.08.12 to 16.05.6 and config changes

2016-12-03 Thread Uwe Sauter
Dear list, this week I updated from 15.8.12 to 16.05.6. Together with this upgrade I also changed some of the configuration options to allow a shared usage (user exclusive) of nodes. Since then some of my users report that their jobs get killed when they allocate more than half of the

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-12 Thread Uwe Sauter
Also. CPUs=32 is wrong. You need Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Am 12.09.2016 um 16:02 schrieb alex straza: > hello, > > We have some slurm nodes that have 32 CPUS - two 8-core processors with > hyperthreading - and are trying to run some > "embarrassingly parallel" jobs.

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-12 Thread Uwe Sauter
Try SelectTypeParameters=CR_Core instead of CR_CPU http://slurm.schedmd.com/cons_res.html Am 12.09.2016 um 16:02 schrieb alex straza: > hello, > > We have some slurm nodes that have 32 CPUS - two 8-core processors with > hyperthreading - and are trying to run some > "embarrassingly parallel"

[slurm-dev] Re: scontrol reboot won't reboot reserved nodes?

2016-03-01 Thread Uwe Sauter
; On Mon, 29 Feb 2016 11:50:18 PM Uwe Sauter wrote: > >> Did you configure the RebootProgram parameter in slurm.conf and is that >> script working? Remember: this script is run on the compute node, therefore >> it must be available on the compute node and must be executable. >

[slurm-dev] Re: scontrol reboot won't reboot reserved nodes?

2016-02-29 Thread Uwe Sauter
Did you configure the RebootProgram parameter in slurm.conf and is that script working? Remember: this script is run on the compute node, therefore it must be available on the compute node and must be executable. Am 01.03.2016 um 01:54 schrieb Christopher Samuel: > > Hi folks, > > We're at

[slurm-dev] Re: select/cons_res, memory limitation and cgroups

2016-02-18 Thread Uwe Sauter
as well). Regards Uwe Am 16.02.2016 um 09:22 schrieb Diego Zuccato: > > Il 15/02/2016 12:39, Uwe Sauter ha scritto: > >> I am unsure how this can be implemented. If I call "ulimit -d >> $((SLURM_MEM_PER_CPU * SLURM_NTASKS_PER_NODE * 1024))" in the >> Pr

[slurm-dev] select/cons_res, memory limitation and cgroups

2016-02-15 Thread Uwe Sauter
Hi all, last paragraph of http://slurm.schedmd.com/cons_res_share.html states that enforcement of memory allocation limits needs to be done by setting appropriate system limits (I assume by using "ulimit"). I am unsure how this can be implemented. If I call "ulimit -d $((SLURM_MEM_PER_CPU *

[slurm-dev] Re: slum in the nodes not working

2015-12-21 Thread Uwe Sauter
scontrol update state=IDLE nodename= Am 21.12.2015 um 21:40 schrieb Fany Pagés Díaz: > When I start the server, the nodes was down, I start /etc/init.d/slurm en in > the server and it´s fine, but in the nodes > are down. I restart the nodes again and nothing. any idea? > > > > *De:*Carlos

[slurm-dev] Re: Drain reason overwritten

2015-12-18 Thread Uwe Sauter
Hi, depending on what you do with those nodes it might be a good idea to create a maintenance reservation. scontrol create reservation=Wartung flags=MAINT or you can set the node to DOWN before stopping slurmd. Regards, Uwe Am 18.12.2015 um 11:18 schrieb Danny Rotscher: >

[slurm-dev] Re: [slurm-devel] update SLURM 2.6.7 to SLURM 15.0.8.4

2015-11-15 Thread Uwe Sauter
As far as I can recall I had to recompile my MPI libs when I upgraded from 14.03 to 14.11. I think one of the issues was with changes in PMI(2) interface. Regards, Uwe Am 15.11.2015 um 01:09 schrieb Apolinar Martinez Melchor: > Hi, > > We want to update SLURM 2.6.7 to SLURM

[slurm-dev] Re: Slurm version 15.08.4 is now available

2015-11-14 Thread Uwe Sauter
Hi, will there be an official backport of the mentioned commits for the 14.11 branch? Regards, Uwe Am 13.11.2015 um 23:59 schrieb Danny Auble: > > Slurm version 15.08.4 is now available it includes about 25 bug fixes > developed over the past couple of weeks. > > One notable fix

[slurm-dev] mem_bind: manpage needs enhancement

2015-10-05 Thread Uwe Sauter
Hi, the manpage to sbatch states for option --mem_bind: […] The following informational environment variables are set when --mem_bind is in use: SLURM_MEM_BIND_VERBOSE SLURM_MEM_BIND_TYPE SLURM_MEM_BIND_LIST See the ENVIRONMENT VARIABLES section for a more detailed

[slurm-dev] Re: sreport inconsistency

2015-07-24 Thread Uwe Sauter
If he used 9000 cores for 24h… Kidding aside, you need to give us more info to work with. Regards, Uwe Am 24.07.2015 um 17:06 schrieb Martin, Eric: Hi, I need help determining what's going on here. The output of sreport says user1 has used 12920941 minutes (215349 hours).

[slurm-dev] Re: This cluster 'cluster' hasn't registered yet, but we have jobs that ran?

2015-07-15 Thread Uwe Sauter
Hi, because accounting in Slurm allows more than just accounting but also limiting users, etc. you need to tell slurm about your users. Having the available on your systems (locally, LDAP) is not enough. Docs are here: http://slurm.schedmd.com/documentation.html

[slurm-dev] Re: slurm-dev Comment logging

2015-06-26 Thread Uwe Sauter
I'm not aware of an option that allows the comment to appear in the slurmctld log file but if you are already using accounting take a look at the AccountingStoreJobComment option in slurm.conf. Regards, Uwe Am 26.06.2015 um 23:04 schrieb Cooper, Adam: Hi, Is there some way for

[slurm-dev] Re: Problem running OpenMPI over slurm

2015-06-18 Thread Uwe Sauter
I had to install it manually when I wanted to use PMI2. Might be that this is not necessary for the older PMI (version 1). Am 18.06.2015 um 02:02 schrieb Christopher Samuel: On 18/06/15 03:26, Uwe Sauter wrote: just a dumb question but did you actually built Slurm's PMI plugin

[slurm-dev] Re: Problem running OpenMPI over slurm

2015-06-17 Thread Uwe Sauter
Hi, just a dumb question but did you actually built Slurm's PMI plugin? As it is considered additional you have to manually compile and install it… Regards, Uwe Am 17.06.2015 um 18:52 schrieb Wiegand, Paul: Rémi, This got me a bit farther, thanks. The stack trace stuck in

[slurm-dev] Re: Question : difference between sbatch and srun ?

2015-06-12 Thread Uwe Sauter
sbatch is used to submit job scripts to the scheduler. srun is used in job scripts (or interactively) for each job step that should run on more then one core. See the documentation: http://slurm.schedmd.com/quickstart.html http://slurm.schedmd.com/man_index.html

[slurm-dev] RE: Request information for our current tasks - Need inputs - IMPORTANT

2015-06-02 Thread Uwe Sauter
The impact of encryption very much depends on the instruction set of your CPU (Intel AES-NI) and whether your library will use those. If you have a recent enough CPU you won't see much difference between normal SSH and HPN-SSH… Am 02.06.2015 um 20:10 schrieb John Lockman: This is a bit off

[slurm-dev] Re: SLURM on commodity Linux cluster;

2015-05-14 Thread Uwe Sauter
Before being able to answer your questions, you'd probably should tell us what you want to achieve with a workload manager such as Slurm. Am 14.05.2015 um 17:58 schrieb Pradeep Bisht: [Resending as I don't see my message in the archives] Show message history I'm looking at SLURM and it

[slurm-dev] Re: Pulling program results from nodes

2015-05-07 Thread Uwe Sauter
Trevor, I don't know what your intent is or the machine you are preparing yourself for but in general login nodes and compute nodes share a common filesystem, making the need to move data around (inside of the cluster) unnecessary. If you really need to move data from node local space back to

[slurm-dev] Re: Pulling program results from nodes

2015-05-07 Thread Uwe Sauter
documentation mentions Lustre and NFS but was just curious because I have no experience with either. Thanks, Trevor On May 7, 2015, at 7:28 PM, Uwe Sauter uwe.sauter...@gmail.com wrote: Trevor, I don't know what your intent is or the machine you are preparing yourself for but in general

[slurm-dev] Re: slurmd on first node not responding and is running

2015-05-07 Thread Uwe Sauter
launch for job 352 failed: Invalid job credential”. Any idea what might be causing this error? I’ve turned my SlurmdDebug up to 7 but the log file essentially says the exact same thing as the stdout when I try and submit a job. Thanks, Trevor On May 6, 2015, at 9:54 PM, Uwe Sauter uwe.sauter

[slurm-dev] Re: Error while loading shared libraries

2015-05-06 Thread Uwe Sauter
Check the file permissions for libmpichcxx.so.1.2 as well as the permissions on the parent directories. Might be that you are not allowed to access the folder structure as the user you're running your application. Am 06.05.2015 um 07:57 schrieb Fany Pagés Díaz: Hello, WhenI throw a MPI

[slurm-dev] Re: Slurm and PMI2

2015-04-27 Thread Uwe Sauter
Perhaps this thread from last week helps: first post: https://groups.google.com/forum/#!topic/slurm-devel/Z6-tnIzI1IE my question about PMI2: https://groups.google.com/d/msg/slurm-devel/Z6-tnIzI1IE/2nfrwocTNF4J Regards, Uwe Am 27.04.2015 um 21:11 schrieb Ulf Markwardt: Dear

[slurm-dev] Re: MVAPICH2 2.1 and SLURM docs

2015-04-21 Thread Uwe Sauter
Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu On Mon, Apr 20, 2015 at 2:13 PM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com

[slurm-dev] Re: Is there a scontrol command to restart slurm daemons for all nodes?

2015-04-20 Thread Uwe Sauter
Hi, no command that I'm aware of. I'm using pdsh for such occasions. Regards, Uwe Am 20.04.2015 um 15:43 schrieb jupiter: Hi, If I have a centralized slurm.conf linked by all nodes, and if I change the slurm.conf, I need to restart slurm daemons of all nodes. Since slurm

[slurm-dev] Re: MVAPICH2 2.1 and SLURM docs

2015-04-20 Thread Uwe Sauter
Hi Trey, is that a will NOT just work or a will just work? Regards, Uwe Am 20.04.2015 um 20:14 schrieb Trey Dockendorf: Just a heads up to anyone who uses MVAPICH2 with srun. The 2.1 docs for MVAPICH2 have new configure flag values since 2.1 supports PMI-2 with SLURM. If you

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Uwe Sauter
Hi, I have the case that OpenMPI was built against Slurm 14.03 (which provided libslurm.so.27). Since upgrading to 14.11 I get errors like: [controller:35605] mca: base: component_find: unable to open /opt/apps/openmpi/1.8.1/gcc/4.9/0/lib/openmpi/mca_ess_pmi: libslurm.so.27: cannot open shared

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Uwe Sauter
and/or dependencies. I'm afraid that you do indeed need to recompile OMPI in that case. You probably need to rerun configure as well, just to be safe. Sorry - outside OMPI's control :-/ On Thu, Apr 16, 2015 at 5:22 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Uwe Sauter
setup. On Thu, Apr 16, 2015 at 5:32 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote: Hi Ralph, beside the mentioned libslurm.so.28 there is also a libslurm.so pointing to the same libslurm.so.28.0.0 file. Perhaps OpenMPI could use this link

[slurm-dev] Re: 'sbatch : node count specification invalid' if more than 1 node in --nodelist

2015-04-05 Thread Uwe Sauter
Hi, I think that you need to specify the number of nodes when supplying a nodelist containing more that one node. Regards, Uwe Am 05.04.2015 um 09:21 schrieb Edrisse Chermak: Dear Slurm Developers and Users, I get an 'sbatch error: Batch job submission failed: Node count

[slurm-dev] Re: Problems running job

2015-03-31 Thread Uwe Sauter
Yes! There are problems if the clean-up scripts for cgroups reside on NFSv4. Nodes will lock-up when they try to remove a job's cgroup. Am 31.03.2015 um 17:06 schrieb Jeff Layton: That's what I've done. Everything is in NFSv4 except for a few bits: /etc/slurm.conf /etc/init.d/slurm

[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Uwe Sauter
It would be helpful to see how you submitted the job. And the output from scontrol show job 20. Regards, Uwe Am 30.03.2015 um 19:49 schrieb Carl E. Fields: Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture:

[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Uwe Sauter
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12] [SlurmUser@sod264 services]$ Thanks, Carl On Mon, Mar 30, 2015 at 11:01 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter

[slurm-dev] Re: SLURMCTLD ERROR

2015-03-25 Thread Uwe Sauter
Please provide more information: Which OS? Which Slurm version? Installed via package or from source? Regards, Uwe Am 25.03.2015 um 13:09 schrieb suprita.bot...@wipro.com: Hi Can please someone help me in knowing that why slurmctld is getting killed in very few seconds.

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Uwe Sauter
And if you are planning on using cgroups, don't use NFSv4. There are problems that cause the NFS client process to freeze (and with that freeze the node) when the cgroup removal script is called. Regards, Uwe Sauter Am 24.03.2015 um 20:50 schrieb Paul Edmon: Yup, that's exactly

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Uwe Sauter
to communicate and to drain the node. Uwe Am 24.03.2015 um 21:12 schrieb Paul Edmon: Interesting. Yeah we use v3 here. Hadn't tried out v4, and good thing we didn't then. -Paul Edmon- On 03/24/2015 04:05 PM, Uwe Sauter wrote: And if you are planning on using cgroups, don't

[slurm-dev] Re: How to debug a job that won't start

2015-03-13 Thread Uwe Sauter
think of right now. I'll have another espresso soon enough and will reply if anything else comes to mind. I hope this helps! John DeSantis 2015-03-12 4:59 GMT-04:00 Uwe Sauter uwe.sauter...@gmail.com: No one able to give a hint? Am 10.03.2015 um 17:05 schrieb Uwe Sauter: Hi, I have

[slurm-dev] Re: How to debug a job that won't start

2015-03-12 Thread Uwe Sauter
No one able to give a hint? Am 10.03.2015 um 17:05 schrieb Uwe Sauter: Hi, I have an account production configured with limitations GrpNodes=18, MaxNodes=18, MaxWall=7-00:00:00, an associated user with limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations Priority=10

[slurm-dev] Bug: different priorities from scontrol and sprio

2015-03-12 Thread Uwe Sauter
Hi, there is a difference in the output of scontrol show job and sprio (14.11.4). I have two jobs, one was submitted before slurmctld was restarted, the other one after the restart. sprio -l shows: JOBID USER PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOS NICE 14115

[slurm-dev] Re: node getting again and again to drain or down state

2015-03-10 Thread Uwe Sauter
Subject: [slurm-dev] Re: node getting again and again to drain or down state What is the output of sinfo -R for this node ? Le 10/03/2015 10:08, Uwe Sauter a écrit : Check that your node resources in slurm.conf represent your actual configuration, e.g. that the amount of memory in your

[slurm-dev] How to debug a job that won't start

2015-03-10 Thread Uwe Sauter
Hi, I have an account production configured with limitations GrpNodes=18, MaxNodes=18, MaxWall=7-00:00:00, an associated user with limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations Priority=10, GraceTime=00:00:00, PreemtMode=cluster, Flags=DenyOnLimit, UsageFact0r=1.0,

[slurm-dev] Re: node getting again and again to drain or down state

2015-03-10 Thread Uwe Sauter
Check that your node resources in slurm.conf represent your actual configuration, e.g. that the amount of memory in your node is configured as equal or less in slurm.conf. Am 10.03.2015 um 10:05 schrieb suprita.bot...@wipro.com: Hi Please help me if anyone can. I am running

[slurm-dev] Re: umask for output files

2015-03-05 Thread Uwe Sauter
If you know the name of your output file you could probably do somethinig like this: touch output chmod 0666 output chown user:group output srun a.out Am 05.03.2015 um 22:11 schrieb Slurm User: Hi I have a bash script which makes a call to srun The srun command calls a simple a.out

[slurm-dev] RE: Odd problem with CPU totals

2015-03-05 Thread Uwe Sauter
I think the problem lies in your configuration. Having both CPUs=4 and (SocketsPerBoard=2 CoresPerSocket=2) is redundant. Please try with one or the other, preferably with SocketsPerBoard=2 CoresPerSocket=2 as this provides information for CPU pinning. Am 06.03.2015 um 00:02 schrieb Sarlo,

[slurm-dev] Re: Upgrade Slurm to latest version

2015-02-11 Thread Uwe Sauter
the new version? -Original Message- From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] Sent: Wednesday, February 11, 2015 9:29 AM To: slurm-dev Subject: [slurm-dev] Re: Upgrade Slurm to latest version Hi, as far as I know (and someone please correct me if I'm wrong) Slurm

[slurm-dev] Re: Upgrade Slurm to latest version

2015-02-11 Thread Uwe Sauter
Message- From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] Sent: Wednesday, February 11, 2015 9:29 AM To: slurm-dev Subject: [slurm-dev] Re: Upgrade Slurm to latest version Hi, as far as I know (and someone please correct me if I'm wrong) Slurm will support the two latest minor

[slurm-dev] Re: Upgrade Slurm to latest version

2015-02-11 Thread Uwe Sauter
Message- From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] Sent: Tuesday, February 10, 2015 4:01 PM To: slurm-dev Subject: [slurm-dev] Re: Upgrade Slurm to latest version Hi Mark, most of your questions are answered in the upgrade section of this page: http://slurm.schedmd.com

[slurm-dev] Re: Upgrade Slurm to latest version

2015-02-10 Thread Uwe Sauter
Hi Mark, most of your questions are answered in the upgrade section of this page: http://slurm.schedmd.com/quickstart_admin.html If you have more questions after reading this, feel free to come back. Regards, Uwe Am 10.02.2015 um 20:47 schrieb Los, Mark J: I have been asked to

[slurm-dev] Re: slurm-dev release schedule - 14.11.4 -- Re: Re: Small bug in scontrol output

2015-02-03 Thread Uwe Sauter
Is it really necessary to re-link MPI for every bug fix release? I had some trouble with MPI after the upgrade 14.3 - 14.11 but I haven't seen problems between bug fix releases so far… Could someone from SchedMD enlighten us? Regards, Uwe Am 03.02.2015 um 01:36 schrieb Kevin Abbey:

[slurm-dev] Re: Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread Uwe Sauter
Might be worth to look into node features (see http://slurm.schedmd.com/slurm.conf.html). Regards, Uwe Am 03.02.2015 um 18:36 schrieb John Desantis: Hello all, Unfortunately, I have some confusion regarding how to achieve a global and single partition for our users with several

[slurm-dev] Small bug in scontrol output

2015-01-28 Thread Uwe Sauter
Hi all, there seems to be a small bug in scontrol show job output when using stderr redirection. %j is not substituted with the jobID. This is on 14.11.3. Submit like: #sbatch -o test-o%j.txt -e test-e%j.txt -N2 -n4 -A admins -t 300 ./test.sh Submitted batch job 12032 #scontrol show job

[slurm-dev] Re: Lock ups with NFSv4 [was: Connection Refused with job cancel]

2015-01-26 Thread Uwe Sauter
Hi all, re-configuring my cluster to use NFSv3 instead of v4 makes the situation go away. I'll leave it that way for now… Thanks for the tip, Uwe Am 19.01.2015 um 23:29 schrieb Christopher Samuel: On 19/01/15 19:46, Uwe Sauter wrote: yes, going back to Scientific 6.5 make

[slurm-dev] Re: Lock ups with NFSv4 [was: Connection Refused with job cancel]

2015-01-19 Thread Uwe Sauter
Hi Trey Am 19.01.2015 um 01:52 schrieb Trey Dockendorf: Uwe, Sorry for delayed response, for some reason messages from slurm-dev are not making it to my inbox so had to find the response via google groups page. Don't worry, there is weekend all around the globe… We also had numerous

[slurm-dev] Environment in prolog/epilog

2015-01-19 Thread Uwe Sauter
Hi, is there a list of SLURM environment variables which I can access in the different prolog/epilog scripts? Specifically is it possible to get a list of nodes for a job in in the PrologSlurmctld script although this runs on the controller host? Perhaps this information could be added to

[slurm-dev] Re: Environment in prolog/epilog

2015-01-19 Thread Uwe Sauter
Stolarek: 2015-01-19 17:35 GMT+01:00 Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com: Hi, is there a list of SLURM environment variables which I can access in the different prolog/epilog scripts? Specifically is it possible to get a list of nodes

[slurm-dev] Lock ups with NFSv4 [was: Connection Refused with job cancel]

2015-01-17 Thread Uwe Sauter
Hi Trey, Christopher, I am running into a lock up situation since updating from Scientific Linux 6.5 to 6.6 in mid of December (2.6.32-431.29.2 to 2.6.32-504.3.3), running Slurm 14.11.2. My cluster runs from NFSv4 root but has local disks for TMP. Jobs that don't use local TMP much run fine

[slurm-dev] Re: Restricting number of jobs per user in partition

2015-01-14 Thread Uwe Sauter
Hi, have a look into Slurm accounting/QoS. There are options to limit jobs per user, jobs per group, etc. pp. http://slurm.schedmd.com/accounting.html http://slurm.schedmd.com/qos.html Regards, Uwe Am 14.01.2015 um 10:14 schrieb Loris Bennett: Hi, I have a test partition in

[slurm-dev] Re: Restricting number of jobs per user in partition

2015-01-14 Thread Uwe Sauter
branches, this is the point where the tree analogy fails to apply. Regards, Uwe Am 14.01.2015 um 14:47 schrieb Loris Bennett: Hi, Uwe Sauter uwe.sauter...@gmail.com writes: Hi, an association is the combination of * QoS * partition * account * cluster If I understand

[slurm-dev] Re: Restricting number of jobs per user in partition

2015-01-14 Thread Uwe Sauter
, Uwe Sauter uwe.sauter...@gmail.com writes: Hi, have a look into Slurm accounting/QoS. There are options to limit jobs per user, jobs per group, etc. pp. http://slurm.schedmd.com/accounting.html http://slurm.schedmd.com/qos.html I saw that a QOS can restrict the number of jobs per

[slurm-dev] Re: Restricting number of jobs per user in partition

2015-01-14 Thread Uwe Sauter
configuration to restrict number of procs. Per user using associations. Could you please suggest any command.. or Configuration related stuff..? Thanks, Tejas -Original Message- From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] Sent: Wednesday, January 14, 2015 5:27 PM To: slurm

[slurm-dev] Re: Allocation will fail if selecting partition

2014-12-11 Thread Uwe Sauter
Am 10.12.2014 um 17:52 schrieb Uwe Sauter: Hi all, Friday afternoon I accidentally upgraded from 14.03.9 to 14.11.1 (just wanted to compile but then a symlink was changed and the new version was started). My users were still using the older version of the tools. Since Monday

[slurm-dev] Re: Allocation will fail if selecting partition

2014-12-11 Thread Uwe Sauter
to have something like although enough nodes are available, the job cannot run due to [reason]. Regards, Uwe Am 11.12.2014 um 10:25 schrieb Uwe Sauter: Hi all, I flushed my database and downgraded to 14.03.10 (with --enable-debug) but the problem still exists. What confuses me most

[slurm-dev] Error in sacctmgr usage message

2014-12-11 Thread Uwe Sauter
Hi all, there is an error in the usage message of sacctmgr. # sacctmgr --help [...] One can get an number of characters by following the field option with a %NUMBER option. i.e. format=name%30 will print 30 chars of field name. Account- Account, CoordinatorList,

[slurm-dev] Re: Allocation will fail if selecting partition

2014-12-11 Thread Uwe Sauter
Thanks Moe, that'll make it easier to see why jobs are in pending state though there are enough nodes available. Regards, Uwe Am 11.12.2014 um 18:37 schrieb je...@schedmd.com: Quoting Uwe Sauter uwe.sauter...@gmail.com: Hi all, I was able to resolve this issue. The problem

[slurm-dev] Allocation will fail if selecting partition

2014-12-10 Thread Uwe Sauter
Hi all, Friday afternoon I accidentally upgraded from 14.03.9 to 14.11.1 (just wanted to compile but then a symlink was changed and the new version was started). My users were still using the older version of the tools. Since Monday (but probably since the update) users weren't able to submit

[slurm-dev] Re: linux users and slurm

2014-11-05 Thread Uwe Sauter
Hi Anna, I'm sorry to inform you that you have to have the user information on all nodes. You cannot run jobs with UIDs from users the local system does not know. If you don't want to distribute your /etc/passdw, /etc/shadow and /etc/group everytime a user is added or removed the best option

[slurm-dev] Re: linux users and slurm

2014-11-05 Thread Uwe Sauter
Kostikova: Dear Uwe, Thanks a lot for your quick help and explanation. Indeed, we use openldap right now, but was wondering whether another solution is possible. So, it seems like the best (only) solution is LDAP indeed. Thanks a lot again, Anna On 5 Nov 2014, at 20:37, Uwe

[slurm-dev] Documentation mismatch: man pages / html

2014-10-20 Thread Uwe Sauter
Hi all, I'm trying to configure the scheduling parameter max_switch_wait on 14.03.8 but a) there seems to be a mismatch between the html documentation and the salloc/sbatch/srun man pages. b) Slurm doesn't seem to know the parameters referenced in the documentation. Regarding a) Manpages

[slurm-dev] Re: Documentation mismatch: man pages / html

2014-10-20 Thread Uwe Sauter
=max_switch_wait=864000,... Regards, Carles Fenoy Barcelona Supercomputing Center On Mon, Oct 20, 2014 at 10:27 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote: Hi all, I'm trying to configure the scheduling parameter max_switch_wait on 14.03.8

[slurm-dev] Re: SLURM experience with high throughout of short-running jobs?

2014-10-14 Thread Uwe Sauter
Hi Chris, you're right that the job array size is limited to 64k in 14.03 and before. With the upcoming 14.11 this limit is raised to 4M IIRC. You could check this year's SLURM user group presentations (http://slurm.schedmd.com/publications.html) where this was mentioned. A far as I know there

[slurm-dev] plugins: documentation - code mismatch

2014-10-09 Thread Uwe Sauter
Hi all, could someone please confirm that the variable const uint32_t plugin_legacy found in http://slurm.schedmd.com/plugins.html - Data Objects was replaced with const uint32_t min_plug_version found in several code files in src/plugins/* sometime in the past? If this is correct, could

[slurm-dev] Re: Error with slurmctld

2014-10-09 Thread Uwe Sauter
Hi Monica, Am 09.10.2014 19:59, schrieb Monica Marathe: Hi Uwe, Thanks for your help on the slurm error. I created a new slurm.conf using the easy configurator but am still facing the following error: [root@control-machine Monica]# slurmctld -D -vv slurmctld: pidfile not locked,

[slurm-dev] Re: Sample slurm.conf

2014-10-09 Thread Uwe Sauter
And port 6817 as well Am 09.10.2014 21:04, schrieb Monica Marathe: Hey Michael, I did build my configuration file: # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. #

[slurm-dev] Re: slurm and grid capabilities

2014-10-03 Thread Uwe Sauter
Hi, there is some limited capabilities for the tools to send/query clusters they don't belong to (search for the -M option). Tools belong to the cluster that is configured in the slurm.conf they use. And there is some work taking place at CSCS (Switzerland) that was presented last week on the

[slurm-dev] Re: Error with slurmctld

2014-10-01 Thread Uwe Sauter
Hi Monica, Am 01.10.2014 21:55, schrieb Monica Marathe: Hey, It's my first time using SLURM and I'm getting the following error when I run slurmctld: [root@localhost ~]# slurmctld -D -vv slurmctld: debug2: No last_config_lite file (/tmp/last_config_lite) to recover slurmctld:

[slurm-dev] sinfo manpage error?

2014-09-16 Thread Uwe Sauter
Hi all, taken from the current 14.03.7 version of the sinfo manpage: snip -o output_format, --format=output_format Specify the information to be displayed using an sinfo format string. Format stri ngs transparently used by sinfo when running with various options are

[slurm-dev] Re: sinfo manpage error?

2014-09-16 Thread Uwe Sauter
Hi Moe, thank you. Can this format specifier also be used in the job name field? Best, Uwe Am 16.09.2014 18:27, schrieb je...@schedmd.com: Documentation updated. see: https://github.com/SchedMD/slurm/commit/3e5864b6486bbd95ceacd695a503f85b3c0c4b8c Quoting Uwe Sauter

[slurm-dev] Re: remote job submission

2014-09-08 Thread Uwe Sauter
Hi Erica, you need munge running and slurm installed. The local slurm.conf needs to point to the control server (ControlAddr and/or ControlMachine). Easiest way is to use the same config file as for the cluster. Regards, Uwe Am 08.09.2014 19:26, schrieb Erica Riello: Hi, I have

[slurm-dev] Re: Power save config option: BatchStartTimeout

2014-09-05 Thread Uwe Sauter
Hi all, *bump* I can't believe no one has an explanation for this parameter... Regards, Uwe Am 02.09.2014 um 16:30 schrieb Uwe Sauter: Hi all, I'm a bit confused by the explanation of the BatchStartTimeout option. It states: Specifies how long to wait after a batch job

[slurm-dev] Power save config option: BatchStartTimeout

2014-09-02 Thread Uwe Sauter
Hi all, I'm a bit confused by the explanation of the BatchStartTimeout option. It states: Specifies how long to wait after a batch job start request is issued before we expect the batch job to be running on the compute node. Depending upon how nodes are returned to service, this value may need

[slurm-dev] Re: Power save support (not working?)

2014-09-01 Thread Uwe Sauter
show hostnames $1` for host in $hosts do echo sudo /share/system/bin/node_poweroff $host /var/log/power_save.log sudo /share/system/bin/node_poweroff $host /var/log/power_save.log done On Fri, 2014-08-29 at 02:36 -0700, Uwe Sauter wrote: Hi, thanks for the suggestion

[slurm-dev] Re: Power save support (not working?)

2014-08-29 Thread Uwe Sauter
with scontrol and checking the log file. On 28 Aug 2014 19:18, Uwe Sauter uwe.sauter...@gmail.com wrote: Hi all, (configuration and scripts below text) I have configured SLURM to power down idle nodes but it probably is misconfigured. I aim for a configuration where after a certain period

[slurm-dev] Re: Upgrading and not losing jobs

2014-08-24 Thread Uwe Sauter
Hi Dennis, I started using SLURM only a few weeks ago but I suspect that an update from 2.4.x to 14.03.x in a single step is not possible because of too many changes in internal structures (both job state information and database). There is on entry in the FAQ

[slurm-dev] Re: How to size the controller systems

2014-08-18 Thread Uwe Sauter
Hi Louis, depending on the usage scenario of your cluster you will have different requirements. You can find general information about SLURM configuration on the SchedMD website: http://slurm.schedmd.com/ There you will also find more specific subpages regarding * cluster configuration for

[slurm-dev] Dynamic partitions on Linux cluster

2014-08-14 Thread Uwe Sauter
? Or is there another way that I don't see? Best regards, Uwe Sauter

[slurm-dev] Re: Dynamic partitions on Linux cluster

2014-08-14 Thread Uwe Sauter
others are not using it. Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 8/14/14, 4:11 AM, Uwe Sauter uwe.sauter...@gmail.com wrote: Hi all, I got a question about

  1   2   >