Hi,
Testing a new slurm cluster (14.11.4) on a 1k nodes cluster.
Several things we've tried:
Increase slurmctld threads (8 ports range)
Increase munge threads (threads=10)
Increase messageTimeout to 30
We are using accounting (db on different server)
Thanks for any help
FYI
If using an unpatched gnu binutils to link slurm, it will not build as
the symbols are undefined in libncurses.
Adding -ltinfo should solve the issue.
Other option is to us -Wl,--copy-dt-needed-entries , as that option
default has changed in recent gnu ld. Redhat in fedora provide their
ference to 'stdscr'
smap.c:319: error: undefined reference to 'nodelay'
smap.c:366: error: undefined reference to 'curs_set'
collect2: ld returned 1 exit status
and stdscr reference:
$ readelf -Ws /usr/lib64/libncurses.so | grep stdscr
12:
uot;; then
- NCURSES="-lncurses"
+ NCURSES="-ltinfo -lncurses"
NCURSES_HEADER="ncurses.h"
ac_have_some_curses="yes"
elif test "$ac_have_curses" = "yes"; then
On 05/09/2015 11:32 AM, Daniel Letai wrote:
Fo
then
- NCURSES="-lncurses"
+ NCURSES="-ltinfo -lncurses"
NCURSES_HEADER="ncurses.h"
ac_have_some_curses="yes"
elif test "$ac_have_curses" = "yes"; then
On 05/09/2015 12:31 PM, Daniel Letai wrote:
Here's a sample patch:
Have you looked into epilog as a means to start your analysis automatically?
On 05/13/2015 05:33 PM, Trevor Gale wrote:
Hey all,
I was just wondering if there is any mechanism built into slurm to signal to
the user when jobs are done (other than email). I’m making a script to run a
series of
?
Thanks,
Trevor
> On May 13, 2015, at 3:21 PM, Daniel Letai wrote:
>
>
> Have you looked into epilog as a means to start your analysis
automatically?
>
> On 05/13/2015 05:33 PM, Trevor Gale wrote:
>> Hey all,
>>
>>
I know this was previously discussed but I just ran into the same issue
today - 3 tables do not have a primary key:
*_last_ran_table (I use hourly_rollup)
*_suspend_table (I use job_db_inx)
and
jobcomp_table
I'm thinking of using jobid, but would like your input.
Thanks in advance,
--Dani_L.
I had to use composite pk as jobid is not unique (requeue recreates it).
I used jobid,starttime,endtime.
Is there a better solution?
On 05/27/2015 06:13 PM, Daniel Letai wrote:
I know this was previously discussed but I just ran into the same
issue today - 3 tables do not have a primary
Are you sure you set SelectType=select/cons_res?
It seems from your description that slurm allocates entire nodes for jobs.
On 06/17/2015 10:28 AM, Saerda Halifu wrote:
Hi,
I just updated to slurm 14.11.7 , and having following issue.
I have nodes with 32 cores, they all have 1 core job alloca
What should be the default?
Suppose I have 10G mem nodes, each with a 10 core socket.
Should I allocate 1G==1000M in DefMemPerCPU, or just 1M to force users
to actually use --mem in their job submissions?
On 06/09/2015 06:12 PM, Moe Jette wrote:
Slurm has always allocated all of the memory o
Without looking at the code, I'd assume the slurmctld is responsible for
allocating job ids.
Consider submitting arrays - the individual elements are not assigned
job id until they enter the running state.
Considering that, I'd look for job id assignment in the slurmctld code
(e.g. job_scheduler
Currently I have 2 types of nodes:
old = 2 sockets, 4 cores per socket, 64GB mem
new = 2 sockets, 6 cores per socket, 128GB mem
Since I'm using select/cr_cons and using CR_CPU_Memory, I thought I'd
assign as default the relative amount of memory per core,
old - DefMemPerCPU = 8000
new - DefMem
u.edu <mailto:treyd...@tamu.edu>
On Wed, Jun 17, 2015 at 9:22 AM, Daniel Letai <mailto:d...@letai.org.il>> wrote:
Currently I have 2 types of nodes:
old = 2 sockets, 4 cores per socket, 64GB mem
new = 2 sockets, 6 cores per socket, 128GB mem
Since I'm using sele
Hi,
I've noticed configure checks for json parser availability, however in
rhel6 based systems the json-c-devel rpm from epel(6) installs to
/usr/include/json while the configure check is for /usr/include/json-c
(configure:19424)
BTW, when building without packaging (i.e. tar xf
slurm-15
On 09/09/2015 01:53 AM, Moe Jette wrote:
Quoting Daniel Letai :
Hi,
I've noticed configure checks for json parser availability, however
in rhel6 based systems the json-c-devel rpm from epel(6) installs to
/usr/include/json while the configure check is for
/usr/include/json-c (conf
Sorry, was sent to list by mistake, please delete
On 09/10/2015 04:23 PM, Daniel Letai wrote:
On 09/09/2015 01:53 AM, Moe Jette wrote:
Quoting Daniel Letai :
Hi,
I've noticed configure checks for json parser availability, however
in rhel6 based systems the json-c-devel rpm from e
Basically build a socket based scheduling (using sockets instead of
cores), and build a gres configuration for the GPUs, 2 lines - 1 with
CPUs=0-9, the other with CPUs=10-19
see http://slurm.schedmd.com/gres.html
and http://slurm.schedmd.com/slurm.conf.html (search for CR_Socket)
I'm not sure
It would be easy if there was a way to force TRES
allocation/reconfiguration, e.g.
Add the swap as GRES/swap, and on suspend transfer the allocation from
TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start
the new job which requires available mem.
Will it be possible to
I'm curious - what would be he point of such scheduling?
I tried to think about a scenario in which such a setting would gain me
anything significant and came up with nothing. What is the advantage of
this distribution?
On 11/18/2015 08:37 AM, cips bmkg wrote:
Fwd: SLURM : how to have a round-
Hi,
Suppose I have a 100 node cluster with ~5% nodes down at any given time
(maintanence/hw failure/...).
One of the projects requires exclusive use of 5 nodes, and be able to
use entire cluster when available (when other projects aren't running).
I can do this easily if I maintain a stati
-Paul Edmon-
On 11/19/2015 04:49 AM, Daniel Letai wrote:
Hi,
Suppose I have a 100 node cluster with ~5% nodes down at any given
time (maintanence/hw failure/...).
One of the projects requires exclusive use of 5 nodes, and be able to
use entire cluster when available (when other projects a
's less seamless to the
users as they will have to consciously monitor what is going on.
-Paul Edmon-
On 11/19/2015 10:50 AM, Daniel Letai wrote:
Can you elaborate a little? I'm not sure what kind of QoS will help,
nor how to implement one that will satisfy the requirements.
so that
the project cannot dominate the partition; Reservations could be used
too, but you'd need to define at a minimum a start time and duration -
and when not in use the hardware would be idle and unavailable to
other users.
John DeSantis
2015-11-19 13:31 GMT-05:00 Daniel Letai :
ail nodes in special, and the second would run
1 element on special. Would it then use public for the other 3 elements
(provided public has some idle nodes)?
HTH!
John DeSantis
Thanks for your input, it's very helpful :)
--Dani_L.
2015-11-21 11:29 GMT-05:00 Daniel Letai :
John,
That's cor
Hi,
I've run into the same issue with slurm-15.08.3, OS: RHEL 6.5 x64.
slurmctld is reading the SlurmUser setting and starts as user slurm,
however slurmd doesn't respect the SlurmdUser config.
If SlurmdUser is commented out, slurmd starts as user root (in
accordance with documentation) - co
Hi,
I'm trying to build slurm rpms using:
rpmbuild -ta slurm-15.08.6.tar.bz2
And I'm getting errors in netloc_to_topology.c (from contrib/sgi)
I've tried
rpmbuild -ta --define '%_without_netloc 1' ... but I got the same errors
(not surprising, as the spec file has no provision for netloc in th
On 01/05/2016 05:25 PM, GOLPAYEGANI, NAVID (GSFC-6190) wrote:
Thank you for the quick response. See below for my reply.
On 1/4/16, 6:25 PM, "je...@schedmd.com" wrote:
Quoting "GOLPAYEGANI, NAVID (GSFC-6190)" :
Hi,
SLURM newbie here. Anybody have suggestions on how to do the
scheduling
f
MS AD to control linux nodes? too much overhead and susceptible to errors.
If the headnode is a VM, you must time-sync the host, not the guest, as
in most paravirt environments the guest would automatically adjust it's
clock to host.
On 01/04/2016 10:54 PM, Dennis Mungai wrote:
Hello Fany,
Just one comment regarding openmpi building:
https://wiki.fysik.dtu.dk/niflheim/SLURM#mpi-setup - At least with
regard to openmpi, it should be built --with-pmi
On 01/05/2016 01:26 PM, Ole Holm Nielsen wrote:
On 01/05/2016 12:12 PM, Randy Bin Lin wrote:
I was wondering if anyone has a more
Another way to do this would be with features.
Then mpi jobs must specify the feature=OSstring to run, and others can
run without the feature.
I'd use partitions to differentiate HW, not OS, but that's just my
personal bias.
The only issue you'd have is building the slurm rpms 3 times each ti
What's the nsswitch like on the node?
from the node, can you do:
# getent passwd | grep
On 01/05/2016 08:31 PM, Koji Tanaka wrote:
Hello Slurm Community,
I get following errors when I run a job as a LDAP user. However, as a
local user, everything works fine.
$ srun -N1 hostname
srun: error:
Just a couple of observations
1) Naively you can create a skeleton sbatch template which normal jobs
would use by default, and another that multi-thread jobs must
specifically request. Populate the templates via a wrapper script or
submit plugin - use only sbatch, not srun. Other option would b
Your MaxJobCount/MinJobAge combo might be too high, and the slurmctld is
exhausting physical memory, resorting to swap which slows it down thus
exceeding it's scheduling loop time window.
You might wish to increase the scheduling loop duration as per
http://slurm.schedmd.com/slurm.conf.html#OP
A similar question has been asked before (not by me), without an answer:
https://groups.google.com/forum/?hl=en#!topic/slurm-devel/4xkvs0dgYu8
Specifically - suppose I have a gpu cluster, 2 gpus per node, where some
gpus might or might not function correctly (due to heat/fw
issues/malfunction/
body p { margin-bottom: 0cm; margin-top: 0pt; }
1. Regarding QOS - did you set slurm.conf to enforce qos limits?
*AccountingStorageEnforce=*limits,qos
2. Regarding your original question "limit the
number of nodes allocated at the same time to a partion" -
I'm not sure what you mean b
body p { margin-bottom: 0cm; margin-top: 0pt; }
On 02/19/2016 01:44 AM,
je...@schedmd.com wrote:
Slurm version 15.08.8 is now available and includes about 30 bug
fixes
developed over the past four weeks.
Slurm version 16.05.0-pre1 is also available and includes new
develo
body p { margin-bottom: 0cm; margin-top: 0pt; }
Hi,
I have recently opened a bug in https://bugs.schedmd.com/ but I have
now discovered github https://github.com/SchedMD/slurm also seems
quite active - should I have opened the issue in github?
Whats the correct/preferred way to report
body p { margin-bottom: 0cm; margin-top: 0pt; }
Please disregard - slurm on github has "issues" disabled. My
mistake.
On 03/01/2016 05:39 PM, Daniel Letai
wrote:
Where should I report bugs/issues ?
body p { margin-bottom: 0cm; margin-top: 0pt; }
Hi,
I hav
body p { margin-bottom: 0cm; margin-top: 0pt; }
Correct me if I'm wrong, but I don't see any NUMA based reservation
of the CPUs - Do you ensure that each reserved cpu is from a
different socket, and GPU jobs affinity is to correct NUMA node?
On 03/02/2016 12:30 AM, Lachele Foley
wrote:
body p { margin-bottom: 0cm; margin-top: 0pt; }
We went a different route - a healthcheck agent script will modify
the node's features, adding "functional" once all pertinent metrics
are met, including gpfs mount.
in slurm.conf the feature doesn't exist, and sbatch template has
--constrain
body p { margin-bottom: 0cm; margin-top: 0pt; }
Another vote for xCAT here - been using it for ~3 years now, on
installations ranging from 8 to 1+k nodes.
Once you get to know xCAT it's quite easy to manage, although
familiarity with perl will help in any troubleshooting or
customization (
body p { margin-bottom: 0cm; margin-top: 0pt; }
This is somewhat convoluted, but you might achieve this with
gres.conf similar to
Name=gpu CPUs=0,1
Name=gpu CPUs=10,11
Name=cpu CPUs=2-9,12-19 count=16
and when submitting a job
sbatch --gres=gpu:1
Or
sbatch --gres=cpu:16
Or
sbatch
body p { margin-bottom: 0cm; margin-top: 0pt; }
Looking at your patch, and without reviewing the code, I have one
question - is it possible for core 'c' not to be in core_map, nor in
part_core_map? I'm only asking because that case doesn't seem to be
covered by your patch (A private case wo
Does the socket file exists?
What's in your /etc/my.cnf (or my.cnf.d/some other config file) under
[mysqld]?
[mysqld]
socket=/path/to/datadir/mysql/mysql.sock
If a socket value doesn't exist, either create one, or create a link
between the actual socket file and /var/run/mysqld/mysqld.sock
BT
his ?
Thank you in advance.
Regards,
Husen
On Sat, May 21,
2016 at 6:28 PM, Daniel Letai
wrote:
Does the socket file exists?
What's in your /etc/my.cnf (or my.cnf.d/s
body p { margin-bottom: 0cm; margin-top: 0pt; }
The tar file contains a spec, so it's easy to just
rpmbuild -ta slurm-XXX.tar.bz2
and createrepo on the rpms.
If you require any special options during build, this is preferable
to using an end result rpms, as it's quite easy to use defines
body p { margin-bottom: 0cm; margin-top: 0pt; }
How about setting them as multiple clusters, instead of multiple
partitions?
Use sbatch -M cluster1,cluster2 to try to submit to both clusters,
with the first one that accepts the job cancels it for the other.
I don't think it's possible to
body p { margin-bottom: 0cm; margin-top: 0pt; }
Other option might be to use constraints
sbatch -C "part1|part2" will ensure the job runs only on one of the
partitions, and then use node weights as normal, without topology.
On 05/26/2016 12:22 PM, Stuart Franks
wrote:
Hi There,
body p { margin-bottom: 0cm; margin-top: 0pt; }
Forgot the square brackets for the constraints option:
sbatch -C "[part1|part2]"
On 05/26/2016 12:22 PM, Stuart Franks
wrote:
Hi There,
I've recently setup SLURM at our office and have been struggling
to get weights to w
body p { margin-bottom: 0cm; margin-top: 0pt; }
You could just retag it 16.06 and remove some of the "deadline"
pressure ;-)
On 05/30/2016 09:03 AM, Lachlan
Musicman wrote:
Re: [slurm-dev] Re: 16.05?
Fantastic - thanks. I am also going to presume that
Tuesday is PDT (
body p { margin-bottom: 0cm; margin-top: 0pt; }
IIRC slurmdb-direct is only used when accounting doesn't use dbd,
and the slurmctld accesses a db directly.
I might be wrong, though.
On 06/02/2016 04:06 AM, Lachlan
Musicman wrote:
Re: Building SLURM
Actually, I've now also n
body p { margin-bottom: 0cm; margin-top: 0pt; }
As a workaround - can you test
srun --cpu_bind=verbose,map_cpu:
mpirun -slot-list $SBATCH_CPU_BIND_LIST
I'm thinking -slot-list doesn't handle cpu masks, and slurm should
provide an explicit list of IDs.
On 06/07/2016 04:14 PM, Jason Bac
body p { margin-bottom: 0cm; margin-top: 0pt; }
On 06/07/2016 05:24 PM, Steffen
Grunewald wrote:
On Tue, 2016-06-07 at 05:43:19 -0700, Steffen Grunewald wrote:
Good afternoon,
I'm looking for a (simple) set of preemption rules for the following planned
setup:
- three partitions: "urgent"
body p { margin-bottom: 0cm; margin-top: 0pt; }
Possibly with `--multi-prog` as per
http://slurm.schedmd.com/srun.html#SECTION_MULTIPLE-PROGRAM-CONFIGURATION
On 06/13/2016 06:26 PM, Akhilesh Mishra
wrote:
ibrun's substitute on SLURM
Dear Developers and SLURM users,
Accounting, users, associations and Partitions Seems like it should
1http://slurm.schedmd.com/sacctmgr.html#SECTION_SPECIFICATIONS-FOR-USERS
Add a 'partition=' to the user creation.
On 06/29/2016 10:03 AM, Lachlan Musicman wrote:
Is it possible to set a Default Partition against a
how to monitor CPU/RAM usage on each node of a slurm job? python API?
You should use HDF5
1http://slurm.schedmd.com/hdf5_profile_user_guide.html
On 09/19/2016 03:41 AM, Igor Yakushin wrote:
Hi All,
I'd like to be able to see for a given jobid how much resources are
used b
;,')"
On 09/19/2016 08:28 AM, Daniel Letai wrote:
Re: [slurm-dev] how to monitor CPU/RAM usage on each node of a slurm
job? python API?
You should use HDF5
http://slurm.schedmd.com/hdf5_profile_user_guide.html
On 09/19/2016 03:41 AM, Igor Yakushin wrote:
how to monitor CPU
One simple thing to do is enable
1http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
and use a simple script along the lines of:
#!/bin/bash
ntpdate -u ntpsserver.cluster.local ; rc=$?
[[ rc -ne 0 ]] && scontrol update NodeName=$HOSTNAME State=drain
Reason=ntp_
In the gres.conf man page it's mentioned that "If generic resource
counts are set by the gres plugin function node_config_load(), this file
may be optional."
When looking at http://slurm.schedmd.com/gres_plugins.html I can't
figure out from the description for node_config_load() how to remove
Re: [slurm-dev] Re: How to account how many cpus/gpus per node has
been allocated to a specific job?
You should be able to do this with profiling data:
1https://slurm.schedmd.com/hdf5_profile_user_guide.html
Just use the jobacct_gather plugin.
This is very probably an overki
$ git diff
diff --git a/slurm.spec b/slurm.spec
index 941b360..6bb3014 100644
--- a/slurm.spec
+++ b/slurm.spec
@@ -346,6 +346,7 @@ Includes the Slurm proctrack/lua and
job_submit/lua plugin
Summary: Perl tool to print Slurm job state information
Group: Development/System
Looks like an
1XY
question
What do you wish the scheduler to do, specifically?
SLURM is CLI centric, but it's possible to use a
2web
portal through 3rd party extension
On 04/20/2017 01:22 PM, Parag Khuraswar wrote:
\[\-\-ci8hmWUpDM_TMP[if gte mso 9]>
Hi All,
Hello all,
I'm having some trouble with no_consume gres.
Specifically, I'm trying to use a certain gres as a feature, but
prefer it to be gres so I can track it's usage/consumption.
relevant potion of slurm.conf:
GresTypes=b
AccountingStorageTRES=gres/b
NodeName=n01
Title: Re: [slurm-dev] slurm.conf for single node
Here's a complete slurm.conf which I often use for
testing/debugging.
You can safely drop the GresTypes,DebugFlags lines, or add any
other from man slurm.conf
ControlMachine=localhost
AuthType=au
Title: Re: [slurm-dev] Re: Change in srun ?
Did you rebuild mpi with flag "--with-pmi" pointing at slurm's
include dir, after making sure slurm has pmi.h there? I usually
build with both --with-pmi and --with-slurm, although the last one
should be enabled by default.
Title: Re: [slurm-dev] Re: Change in srun ?
Take a look at
https://www.cyberciti.biz/faq/rebuilding-ubuntu-debian-linux-binary-package/
Make sure slurm is already installed before rebuilding the
package, and pass the correct configure flags (--with-pmi= --with-slurm
Title: Re: [slurm-dev] Re: Exceeded job memory limit problem
Do you have enough memory on your nodes?
what's the output of
sinfo -n -O nodelist,memory:20
You might not have enough memory on the nodes for the dataset.
On 09/06/2017 10:36 AM, Sema Atasever
Title: Re: [slurm-dev] defaults, passwd and data
Hello,
On 09/24/2017 08:35 AM, Nadav Toledo
wrote:
defaults, passwd and data
Hey all,
We are trying to setup a Slurm cluster for both cpu and gpu
partition
69 matches
Mail list logo