Hi there, perhaps the best way is to build the RPMs yourself from the
source code.
Please follow the instruction on this page:
http://www.schedmd.com/slurmdocs/quickstart_admin.html
David
On 01/16/2013 09:51 PM, Richard Casey wrote:
Hi,
On the webpage
Hi, do the spawn processes also change process group? The process
tracker should track all processes sharing the same process group ID.
Isn't there an option in ansys to tell it to wait for all its children?
Something like:
#!/bin/sh
sleep 3600
wait
David
On 01/17/2013 02:30 AM, Sten Wolf
Hi,
you have to restart slurmctld for the database to be created.
slurmdbd must be running before
you start slurmctld.
David
On 01/17/2013 06:55 PM, Richard Casey wrote:
Hi,
We installed slurm 2.5.1 with
rpmbuild -ta slurm*.tar.bz2
rpm --installthe rpm files
All the
Damien,
None of your nodes in your output are currently down. Anything with an
endtime is something that has been accounted for. If there are events
without an endtime (Outside of actual downed nodes or Cluster processor
count lines) then those are the ones to look at.
It is interesting
Alessandro, if your jobs leave output files on the execution hosts you
can copy them back to the submission host or any other host using the
task epilog which runs after each job step.
-
#!/bin/sh
#SBATCH --partition=compute
#SBATCH --nodes=2
sbcast -f shello /tmp/shello
I think the idea is that given a script like this one:
-
cat myenv
#!/bin/sh
hostname
ulimit -a
env|sort
echo done: `date`
-
run it as:
ssh myhost myenv LOG.ssh
and as
srun -p mypartition -w myhost myenv LOG.srun
then
Hi, upon startup slurmctld changes its working directory to where the log
file is.
If the log file is:
SlurmctldLogFile=/var/tmp/slurm/slurmctld.log
the working directory is /var/tmp/slurm. Assuming your slurmctld core dump
for whatever reason the core file should be there.
The directory should be
By default Slurm allocates node in exclusive mode. You have to use
Consumable Resources to achieve what you want.
http://schedmd.com/slurmdocs/cons_res.html
/David
On Wed, Feb 6, 2013 at 10:47 AM, Niels Rothermel
niels.rother...@googlemail.com wrote:
Hello,
I think my problem is pretty
I think you have done the steps correctly. What was the error that happened?
/David
On Wed, Feb 6, 2013 at 12:50 PM, Mario Kadastik mario.kadas...@cern.chwrote:
Hi,
today I was adding a few nodes to slurm so I added them to slurm.conf
nodes definition and then restarted slurm controller.
not responding
[2013-02-19T17:42:20] error: Nodes lufer121 not responding
[2013-02-19T17:43:59] error: Nodes lufer121 not responding, setting DOWN
On 02/19/2013 04:50 PM, David Bigagli wrote:
Hi, have a look at the slurmd.log on the none responding hosts. Does it
show any relevant information
they are the same. If this is
expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
/David
On Thu, Feb 21, 2013 at 10:16 AM, Marcin Stolarek stolarek.mar...@gmail.com
wrote:
2013/2/20 David Bigagli da...@schedmd.com
Hi, do you share your slurm.conf or each node has its own? All
Hi, are those files created by a job? In such case you can use the job
epilog to copy the data when the job is done.
http://www.schedmd.com/slurmdocs/prolog_epilog.html
/David
On Sun, Feb 24, 2013 at 6:58 PM, Giulio V. de Musso g...@libero.it wrote:
Hi
I've setup a SLURM cluster made up
Hi,
have a look at ReturnToService in 'man slurm.conf'.
/David
On Fri, Mar 1, 2013 at 9:12 PM, Tim caphrim...@gmail.com wrote:
Hi folks,
I've having some concern with some of my slurm nodes in that the slighest
thing seems to make them go into a down state. Low CPUs, node unexpectedly
Have a look at 'man sacct'.
*/David*
On Mon, Mar 4, 2013 at 8:18 PM, Giulio V. de Musso g...@libero.it wrote:
Hi
I run some sbatches on my SLURM cluster. I need to know the execution time
for each job. So if for example the job ID is 34 SLURM should create a
file named 34.time in which
Have you updated the slurmdb daemon first as described here:
http://schedmd.com/slurmdocs/quickstart_admin.html
*/David*
On Wed, Mar 6, 2013 at 5:58 PM, Lennart Karlsson
lennart.karls...@it.uu.sewrote:
Hi,
Today I upgraded SLURM from v 2.4.3 to v 2.5.3.
It seems like a mistake, because
This is the way select() works regardless of the version of redhat or any
other distribution.
The fd_set is a bit array defined in sys/select.h of __FD_SETSIZE which
is defined as 1024 in bits/typesizes.h
*/David*
On Tue, Mar 12, 2013 at 11:30 AM, Hongjia Cao hj...@nudt.edu.cn wrote:
When
Hi, the problem of memory over-subscription is discusses in 'man
slurm.conf'. Have a look at
DefMemPerCPU, DefMemPerNode and the suggested configuration when
using CR_CPU_Memory.
*/David*
On Tue, Mar 12, 2013 at 3:15 PM, Joo-Kyung Kim supa...@gmail.com wrote:
Hi,
** **
I am using
Is it possible the job runs on several nodes, say -N 3, then one node is
lost so it ends up running on 2 nodes only? Such a job should have been
submitted with ---no-kill.
/David
On Fri, Mar 22, 2013 at 4:06 PM, Michael Colonno mcolo...@stanford.eduwrote:
Actually did mean node below.
Try to start it so it does not damonize I think the option is -D but better
check the man page, see if it core dumps.
sent from galaxy nexus
On Mar 27, 2013 9:31 AM, Pablo Sanz Mercado pablo.s...@uam.es wrote:
Hi Alejandro,
Sorry, the messages we obtain about the couldn't
The available slurm documentation can be found here:
http://slurm.schedmd.com
*/David*
On Thu, Apr 25, 2013 at 11:41 AM, David Bigagli da...@schedmd.com wrote:
Hi, slurm.conf has the following parameter as documented in the
slurm.conf man page:
max_job_bf
Hi,
the gstack command will show you the activities of each thread in the
slurmctld process.
This is an example:
david@prometeo ~gstack 14432
Thread 8 (Thread 0x7fa9c9190700 (LWP 14433)):
#0 0x0035b90acb8d in nanosleep () from /lib64/libc.so.6
#1 0x0035b90aca00 in sleep () from
Indeed currently there is no integration between Flexlm and SLURM, but some
ideas are being passed around what to do about it. I am one of the original
designers and developers of Platform License Scheduler.
The item 1) you mentioned is certainly the first step but consider even
that may not be
of anything different, I, for one, would be happy to know.
Gary D. Brown
On Tue, Jul 2, 2013 at 8:38 AM, David Bigagli da...@schedmd.com wrote:
Indeed currently there is no integration between Flexlm and SLURM, but
some ideas are being passed around what to do about it. I am one of the
original
Hello,
it is a requirement to specify --mpi=pmi2 otherwise the srun will
not load the pmi2 library
implementing the server side pmi2 functionalities.
There was a error in the contribs/pmi2/pmi2_api.c causing the 'no value
for req' message, this was the
-1488 remaining_len -=
Hi,
this issue has been fixed in the 2.6.2 release.
On 09/10/2013 09:01 AM, Michael Gutteridge wrote:
We allow jobs to overrun their wall time via OverTimeLimit. We've
noticed that jobs that complete successfully but go over the wall time
are reported as having JobState=TIMEOUT in the job
Sounds good. :-) Thanks for the patch it is going to be in Slurm 2.6.3.
On 10/02/2013 12:28 AM, Mark Nelson wrote:
Hi All,
It does look like there is a bug in time_str2secs():
If we give it a time of format: days-0:min, we exit the for loop with
days set, but with min set to our hours value
Hi,
I don't know the details of the segfault but the code in question is
correct. If you decrease the length then the file cmdlen:
cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
will not be correct and wrong length will be sent to the pmi2 server.
This code is taken verbatim from
to reflect the reduced size of the c buffer.
On Oct 3, 2013, at 9:44 AM, David Bigagli da...@schedmd.com wrote:
Hi,
I don't know the details of the segfault but the code in question is
correct. If you decrease the length then the file cmdlen:
cmdlen = PMII_MAX_COMMAND_LEN - remaining_len
Fixed. I chose the method proposed by Michael, subtract first, add
later. :-)
On 10/03/2013 10:29 AM, Ralph Castain wrote:
On Oct 3, 2013, at 10:16 AM, David Bigagli da...@schedmd.com wrote:
I am not saying that remaining_len is correct or that mpich is bugless :-) I am
only saying
,
/David/Bigagli
www.schedmd.com
voice: +1 415 320 2776
offline (-o), reset (-r), clear (-c), and -N (set
note/reason) command line options
* adds the -l (brief list) and -n (list with notes) command line options
* format the output in the default verbose list mode more like TORQUE's
pbsnodes does
--Troy
--
Thanks,
/David/Bigagli
is getting the job done. Is there a reason
that slurm_kill_job shouldn't be used?
Thanks
Michael
--
Thanks,
/David/Bigagli
www.schedmd.com
hostlist_push with hostlist_push_host (commit
1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo,
but I think the root cause is in xmalloc.
--
Thanks,
/David/Bigagli
www.schedmd.com
. The multithread patch (commit
17449c066af69441b741110ef51fc2f534272871) does not help. Replacing
hostlist_push with hostlist_push_host (commit
1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo,
but I think the root cause is in xmalloc.
--
Thanks,
/David/Bigagli
www.schedmd.com
;
for (i = 0; i COUNT; i++) {
ptr = malloc(MAX_RANGES * sizeof(struct _range));
free(ptr);
}
END_TIMER;
info(malloc usecs: %lld, delta_t);
return 0;
}
在 2014-03-20四的 10:42 -0700,David Bigagli写道:
Hi,
do you have any data
,
-J
--
Thanks,
/David/Bigagli
www.schedmd.com
output. There are very few format chars left so I just
picked a free one.
Thanks
Martins
--
Thanks,
/David/Bigagli
www.schedmd.com
numbered cores on a node only.
DTILICENSE_ONLY/IDD
--
Thanks,
/David/Bigagli
www.schedmd.com
: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Launching with salloc/sbatch works.
- Anthony
--
Thanks,
/David/Bigagli
www.schedmd.com
Errata corrige. The core file is in the log directory.
On 04/11/2014 12:08 PM, David Bigagli wrote:
Hi,
this Slurm bug has been fixed and it will be available in 14.03.1
which will be released soon. Otherwise it is available in the HEAD.
You should find a core file of slurmstepd
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
StoragePass=slurm_pass
StorageUser=slurm
StorageLoc=slurm_acct_db
--
Thanks,
/David/Bigagli
www.schedmd.com
on its own and not within an salloc, is no longer supported and expected
to fail?
Thanks
Martins
On 5/20/14 1:23 PM, David Bigagli wrote:
In 14.03 you should use the SallocDefaultCommand as documented in
http://slurm.schedmd.com/slurm.conf.html
to srun with the --pty option.
On 05/19/2014 10
}
dir_name = value optimized out
Remembering some previous problems I suspect that some uninitialised
variable in some structure (which represents some omitted option in
slurmd.conf) may cause such effect. Could someone please give me some
hints?
Thanks!
--
Thanks,
/David
you do to run an interactive job array ? By interactive, I
mean that the command only exit at the end of the array ?
Regards,
Julien
--
Thanks,
/David/Bigagli
www.schedmd.com
showing that he effectively used 162 cpu seconds (81 x 2).
Thank you,
Robert
On 8/5/2014 9:53 AM, David Bigagli wrote:
Hi Robert,
the first line is the allocation and the second the batch
step, the batch step runs on one cpu. I am not sure what are the
columns with values 162 and 81
of
the batch step higher than either of the two job steps?
Thank you,
Robert
On 8/5/2014 1:41 PM, David Bigagli wrote:
Yes that is correct. The first entry is the allocation which has 2
cpus, -n 2 was specified, the second entry is the batch step that run
for 81 seconds, so the total cpu time used
Slurm version and same kernel.
Cheers,
--
Thanks,
/David/Bigagli
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
www.schedmd.com
Interesting indeed. Let me have a look at it and experiment with it a bit.
On 08/13/2014 04:16 PM, Kilian Cavalotti wrote:
On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli da...@schedmd.com wrote:
For some reason at the first attempt rmdir(2) returns EBUSY.
Would writing
Unfortunately the article refers to the memory sub system which gets
removed without problem. The issue happens on the freezer, however it is
just an error message without consequences.
On 08/13/2014 04:16 PM, Kilian Cavalotti wrote:
On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli da
--
Thanks,
/David/Bigagli
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
www.schedmd.com
. The slurm.conf line reads:
JobSubmitPlugins=default
in compliance with the documentation.
Thanks for your help
Eva
--
Thanks,
/David/Bigagli
Slurm User Group Meeting
September 23-24, Lugano, Switzerland
Find out more http://slurm.schedmd.com/slurm_ug_agenda.html
www.schedmd.com
Köln, Von-der-Wettern-Strasse 27
Telefon: +49 (0) 2203 305-0
Telefax: +49 (0) 2203 305-1699
http://www.bull.de
Bull, an Atos company
** Folgen Sie uns auf Twitter: http://twitter.com/bull_de
** Bull Firmenprofil bei XING: https://www.xing.com/companies/bullgmbh
--
Thanks,
/David/Bigagli
[2014-09-24T13:47:52.151] error: cannot find job_submit plugin for
job_submit/defaults
[2014-09-24T13:47:52.151] error: cannot create job_submit context for
job_submit/defaults
[2014-09-24T13:47:52.151] fatal: failed to initialize job_submit plugin
On Tue, 16 Sep 2014, David Bigagli wrote
process
resides given a known job_id/step_id?
Thanks,
Andrew
--
Thanks,
/David/Bigagli
www.schedmd.com
I think the question was about the submission node, the node where the
srun/sbatch was executed from.
On 10/10/2014 04:14 PM, Franco Broi wrote:
Are we talking about alloc_node? You can retrieve it using the perl api.
On 11 Oct 2014 06:53, David Bigagli da...@schedmd.com wrote:
Hi
,
/David/Bigagli
www.schedmd.com
of the database? I have been grepping through
the source tree, but I haven't stumbled on the script that creates the
tables and columns needed.
~Charles~
--
Thanks,
/David/Bigagli
www.schedmd.com
:
On 10/31/2014 12:49 PM, David Bigagli wrote:
The database is created by the slurmdbd daemon. Have you granted
access to the database to the slurm user?
Yes, I did a
grant all on slurm_acct_db.* TO 'slurm'@'localhost';
--
Thanks,
/David/Bigagli
www.schedmd.com
srun directly as we get poor scaling, the next
thing in the list (after SC14) is to migrate to Open-MPI 1.8.4 which
is due out shortly which should address this.
cheers,
Chris
--
Thanks,
/David/Bigagli
www.schedmd.com
I agree. Done in commit c8f34560c87cfbbf.
On 11/06/2014 06:46 PM, Christopher Samuel wrote:
On 07/11/14 11:53, David Bigagli wrote:
Hi,
Hiya David,
it used to logged at debug level in 2.6 and now it is an error. This
seems to be an issue with cgroups which does not allow that path
() is a stringly typed interface, and it would be nice if
I could use an interface with stronger types.
Cheers,
Walter Landry
--
Thanks,
/David/Bigagli
www.schedmd.com
,
Department for Research Computing, University of Oslo
--
Thanks,
/David/Bigagli
www.schedmd.com
what amount of time within a QOS.
sacct can give me information on an account level but I can't seem to
get it to report on a QOS level on a user by user bases.
Thanks
Jackie
--
Thanks,
/David/Bigagli
www.schedmd.com
for us that have working cgroups memory limits.
Best regards,
Magnus Jonsson
--
Thanks,
/David/Bigagli
www.schedmd.com
)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=./test.sh
WorkDir=/home/adm17
StdErr=/home/adm17/test-e%j.txt - here %j is not expanded
StdIn=/dev/null
StdOut=/home/adm17/test-o12032.txt
Regards,
Uwe
--
Thanks,
/David/Bigagli
Yes it works with or without --unbuffered. I don't think data are
buffered inside of Slurm.
On 03/09/2015 10:15 AM, Lipari, Don wrote:
-Original Message-
From: David Bigagli [mailto:da...@schedmd.com]
Sent: Thursday, March 05, 2015 10:49 AM
To: slurm-dev
Subject: [slurm-dev] Re
unloaded.
Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon.
-- Subject: Unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
Thanks,
/David/Bigagli
www.schedmd.com
The slurm.spec file decides if to install the init.d scripts or the
systemd stuff.
On 03/24/2015 07:24 PM, Fred Liu wrote:
-Original Message-
From: David Bigagli [mailto:da...@schedmd.com]
Sent: 星期三, 三月 25, 2015 1:19
To: slurm-dev
Subject: [slurm-dev] Re: successful systemd
people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).
Good luck!
Chris
--
Thanks,
/David/Bigagli
www.schedmd.com
by a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
Restarted jobs will have the environment variable
*SLURM_RESTART_COUNT* set to the number of times the job has been
restarted.
-Paul Edmon-
--
Thanks,
/David/Bigagli
www.schedmd.com
trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?
-Paul Edmon-
On 03/03/2015 12:11 PM, David Bigagli wrote:
There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
where it couldn't resolve
user id's. So right after the job tried to launch it failed and
requeued. We just let the scheduler do what it will when it lists
Node_fail.
-Paul Edmon-
On 03/03/2015 01:20 PM, David Bigagli wrote:
How do you set your node down? If I run a job and then issue
fine).
Problem number 2:
'scontrol show jobs' shows jobs in state RUNNING that don't actually
appear to exist. Some of these are days old. What might be going on here?
--
Jon Nelson
Dyn / Senior Software Engineer
p. +1 (603) 263-8029
--
Thanks,
/David/Bigagli
www.schedmd.com
of these issues.
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS NBE
+358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi
mailto:janne.blomqv...@aalto.fi
--
Thanks,
/David/Bigagli
www.schedmd.com
but to search for the
library name looks like a reasonable start to me.
I would hope you can help me with this.
Thanks,
Ulf
--
Thanks,
/David/Bigagli
www.schedmd.com
for the
library name looks like a reasonable start to me.
I would hope you can help me with this.
Thanks,
Ulf
--
Thanks,
/David/Bigagli
www.schedmd.com
, if I go and create a
new qos, all users instantly can utilize this qos. This is very strange,
I wonder if some setting has been munged in the database somewhere
mistakenly? Any ideas? Thanks.
Best,
Chris
--
Thanks,
/David/Bigagli
www.schedmd.com
,
/David/Bigagli
www.schedmd.com
the file actually being there. I've
even set extended ACL's on the directory so that the SlurmUser can see
all of the files (sudo -u slurm ls -lR failed with permission denied).
Could anyone tell me why the slurm_script file cannot be read via prolog?
Thank you!
John DeSantis
--
Thanks,
/David
with any of the changes I made. Is this
a known problem with some kind of workaround, or should I file a bug on it?
Eric
--
Thanks,
/David/Bigagli
www.schedmd.com
Is there a way to limit paging outside of Slurm? There are memory limits in
Slurm but no paging limit.
There is a backup controller in Slurm, you can read about it here:
http://slurm.schedmd.com/slurm.conf.html
Thanks
/David/Bigagli
da...@schedmd.com
module is not a command in /bin. To make it work you have to source the module
startup file in your script, for example:
. /usr/local/Modules/3.2.10/init/bash
then you can use the module file.
On Jun 29, 2015, at 4:43 PM, Antonia Mey antonia@gmail.com wrote:
Dear all,
I may have a
I think that Hydra should kept 0,1,2 open and dup them to /dev/null so that any
children’s
file descriptor will be greater than 2. This is standard Unix way.
Thanks
/David/Bigagli
da...@schedmd.com
===
Slurm User Group Meeting, 15-16
Fixed in 15.08.2.
commit 30a5d6778fc86f8799cefc4fbea4f9ae7eac8d92
Author: Hongjia Cao <hj...@nudt.edu.cn>
Date: Wed Oct 7 15:05:24 2015 +0200
Thanks for your contribution.
On 10/07/2015 12:15 PM, Hongjia Cao wrote:
attached.
--
Thanks,
/David/Bigagli
da...@schedmd.com
Those pmi2 files are the server side of the pmi2 protocol implemented in
slurmstepd, those are always
installed. Is the client side that is the one that get’s installed from the
contribs directory.
Thanks
/David/Bigagli
da...@schedmd.com
Can you show us the stack using gdb ?
Thanks
/David/Bigagli
da...@schedmd.com
===
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html
> On 08 Sep 2015, at 16:45, Mar
y to 4. The minimum index
value is 0. the maximum value is one less than the configura-
tion parameter MaxArraySize.
Thanks
/David/Bigagli
da...@schedmd.com
===
Slurm User Group Meeting, 15-16 September 2015,
87 matches
Mail list logo