I have made a patch for spank to allow to fetch the SLURM_RESTART_COUNT
into my spank plugin.
The patch is attached (against 2.6.6).
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff a/slurm/spank.h b/slurm/spank.h
--- a/slurm/spank.h
+++ b/slurm/spank.h
?
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
sed on events that actually affects the
scheduling of the queue?
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
On 2014-05-20 14:54, Tommi T wrote:
On Tuesday, May 20, 2014 1:51 PM, Magnus Jonsson wrote:
Hi!
While investigating an other matter I found that if you have lots of
jobs running with short job steps they killing the backfill very effective.
Hi,
Do you use bf_continue-flag?
http
://www.hpc2n.umu.se/staff/magnus/slurm/stdout.2.6.4
3, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.2.6.4
4, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.14.03.7
5, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.14.03.7
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime
Is no one else affected by this?
/Magnus
On 2014-09-11 14:46, Magnus Jonsson wrote:
Hi!
A user found a "strange" new behaviour when using --exclusive with srun.
I have an example submit-script[1] that shows this.
I have tested this on 2.6.4 with the output [2] & [3] (stderr)
problem after that then
indeed something else is the matter. Perhaps routing tables or
something else.
U
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
,
Magnus Jonsson
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index 29f730d..cb85598 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1046,6 +1046,9 @@ Exclude shared memory from accounting.
.TP
was changed but I'm pretty sure it worked on in
2.6 (it was when we developed our tmpdir spank plugin).
"SLURM_RESTART_COUNT" is available in the job user environment.
/Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
n
spank_get_item(sp, S_SLURM_RESTART_COUNT, &restart_count);
Hope that helps!
Sent from my iPhone
On Feb 27, 2015, at 8:14 AM, Magnus Jonsson wrote:
It seams that the restart count in SPANK (prolog) is missing in resent versions
of Slurm.
I always returns 0 even if the jobs ha restarted.
hange
the priority at all?
The user can use the 'nice' option to alter the priority of a job within
a small limit that does not alter the priority as defined above.
Please let me be wrong :-)
/Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIM
bscript "job1.sh":
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=job1
srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME"
cat > job2.sh <
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
le jobscript for
#SBATCH statements?
Assume I have the following jobscript "job1.sh":
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=job1
srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME"
cat > job2.sh <
--
Magnus Jonsson, Develo
use "srun cp ${PATH_TO_FILES}/* $TMPDIR/"
Also this has the side effect that the prolog on the node is not run
until you actually send a job to the node.
I.e. you can send data to a node with sbcast before the prolog this
might not be an expected/wanted behaviour.
Best,
Magnus
--
n
perspective I wouldn’t want this, because users could misuse this
feature. But from a user perspective I could genuinely have some
dependencies that I would like to have it addressed before beginning my
batch of thousands of jobs.
Any help here is greatly appreciated.
Regards,
Amit
--
Magnus Jon
t now)?
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
0
sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851
JobIDElapsed AllocCPUS CPUTimeRAW
-- -- --
736485100:00:00 16 0
/Magnus
On 2016-03-23 09:29, Magnus Jonsson wrote:
Hi!
From this simple example
CPUTimeRAW
-- -- --
736485100:00:00 16 0
/Magnus
On 2016-03-23 09:29, Magnus Jonsson wrote:
Hi!
From this simple example could someone explain to me if this is the
expected behaviour or a bug?
$ srun -n1 --exclusive hostname
srun: job 4232239 queued and waiting for reso
rators can submit jobs to those
certain nodes to perform some tests, which might be disturbed by users
submitting their jobs to those nodes. Various Search Engines didn't
offer answers to my question, which is why I'm writing you here.
Looking forward to some answers!
Best,
Felix Willenborg
I can fix this some how otherwise I
can bug test patches.
I have a small part of our cluster available for testing right now
(2 nodes, 8 sockets/node, 6 cores/socket).
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
Err... Wrong...
On 2013-01-18 13:59, Magnus Jonsson wrote:
Hi!
I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird
behaviour.
Currently running git/master but have seen the same behaviour on 2.4.3
with the #define.
My slurm.conf:
SelectType=select/con
This patch fixes the behaviour with allocating 2 cores instead of one
with --ntasks-per-socket=1.
/Magnus
On 2013-01-18 13:59, Magnus Jonsson wrote:
Hi!
I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird
behaviour.
Currently running git/master but have seen the
I have CR_ALLOCATE_FULL_SOCKET working correctly on block allocation.
Will fix cyclic after the weekend and supply a patch..
Best regards,
Magnus
On 2013-01-18 16:00, Magnus Jonsson wrote:
This patch fixes the behaviour with allocating 2 cores instead of one
with --ntasks-per-socket=1
src/common/slurm_errno.c). We have also
discussed adding a mechanism to return an arbitrary string to the
user, but this is not possible today.
Quoting Magnus Jonsson :
Hi!
I looking for a way to look at users submitted parameters and if
they are using it in a "bad" way inform them that this
,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/src/common/read_config.c b/src/common/read_config.c
index 2a54f69..b8d981b 100644
--- a/src/common/read_config.c
+++ b/src/common/read_config.c
@@ -903,6 +903,7 @@ static int _parse_partitionname(void **dest
d historic reason for that.
/Magnus
On 2013-02-07 16:42, Aaron Knister wrote:
That's awesome! (How) does it handle the case of nodes in multiple partitions?
Sent from my iPhone
On Feb 7, 2013, at 8:24 AM, Magnus Jonsson wrote:
Hi everybody!
Here attached is a patch that enables
ay have practical reasons, but that's not why we
do it"
-- Richard P. Feynman
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
es: eval_nodes:0 consec c=48 n=1 b=0
e=0 r=-1
[2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - idle
resources found
[2013-02-12T14:54:48+01:00] no job_resources info for job 241
[2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints
8<---
--
Magnus Jonsso
ock?
Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default?
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
Hi!
This does not make a difference.
And I think it might not do either according to the man page.
/Magnus
On 2013-02-15 18:32, Moe Jette wrote:
Have you tried the --ntasks-per-socket or --ntasks-per-core options?
Quoting Magnus Jonsson :
Hi!
I have noticed strange behaviour in the task
fig generator.
Any pointers would be greatly appreciated- I'm out of ideas...
Thanks
Michael
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
--
Have you tried the --ntasks-per-socket or --ntasks-per-core options?
Quoting Magnus Jonsson :
> Hi!
>
> I have noticed strange behaviour in the task/affinity plugin if I
> use --cpu_bind=socket and -c > 1.
>
> My task are distributed one on each sock
Hi!
I just found a bug in the slurm that creates a buffer overflow if you
run 'scontrol show config'.
Patch attached to fix the problem.
/Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/src/common/slurm_protocol_defs.c b/src/common/slurm_protocol_de
idle nodes short job will not
start because of this.
I have made a patch for backfill with a configuration option
(bf_continue) to let backfill continue from the last JobID of the last
cycle.
This will make backfill look at the whole queue eventually.
Best regards,
Magnus
--
Magnus Jo
speed with that.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
d again).
Does anybody have experience with the case where job (or some script)
checks some condition periodically and stay in a queue if the condition
has not been complied yet?
--
Taras
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
, this solution probably will work for us as well.
Also, when a user does not use -L option, than this could be checked (I
believe) in contribs/lua/job_submit.lua in several lines of code (in
slurm_job_submit function).
--
Taras
On 03/08/2013 09:37 AM, Magnus Jonsson wrote:
We have solved this by u
bitstring to count the number of bits in
an range (bit_set_count_range) and made a minor improvement of
(bit_set_count) while reviewing the range version.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff -ru site/src/common/bitstring.c amd64_ubuntu1004/src/common
nk_ldom sh -c "hwloc-bind --get |
./hex2bin"
results in:
00 00 01 01 01 01 01 01 = 0x41041041
This is also looks like the bitmask that task/affinity gets from slurm.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
As I understand it this will give the wrong input for the fair share
scheduler and results the wrong priority (to high) for the user.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
buffer. But this might needs to be looked
into with more depth.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff -ru site/src/common/parse_time.c amd64_ubuntu1004/src/common/parse_time.c
--- site/src/common/parse_time.c 2013-06-05 21:43:00.0 +0200
+++ amd
be a better solution than changing the code in various places. See
attached variation of your patch.
Moe
Quoting Magnus Jonsson :
Hi!
We found an issue in sacct that we pined down to a strftime call in
'src/common/parse_time.c' (slurm_make_time_str).
Reproducable with (in 2.5.{
Why does slurm complain? I HAVE set CR_Core and I even checked the
spelling.
Thanks
Eva
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
uired.
Reverting params.max_cpus code I get the expected behaviour.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
e expected starttimes
that are that far in the future should not be allowed to appear.
If the starttime is more then a week or two into the future the
starttime will probably not be that accurate anyway.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Descripti
mctld[28426]: _slurm_rpc_job_step_create for
job 1313514: More processors requested than permitted
Oct 4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514
Oct 4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for
JobId=1313514 successful, exit code=256
--
Magnus Jonsson, Developer, H
rmctld[28426]: _slurm_rpc_job_step_create
for job 1313514: More processors requested than permitted
Oct 4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514
Oct 4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for
JobId=1313514 successful, exit code=256
--
Magnus Jonsson, Develope
but this might also confuse the users.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature
r more information about how the job was submitted.
We are currently running version 2.6.3.
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
JobId=1603907 Name=submit_e
UserId=magnus(2066) GroupId=folk(3001)
Priority=658834 Account=sysop QOS=normal
JobState=R
case
srun -l -N2 --tasks-per-node=2 hostname
1: trek0
0: trek0
2: trek1
3: trek1
-Original Message-
From: Magnus Jonsson [mailto:mag...@hpc2n.umu.se]
Sent: Wednesday, February 19, 2014 1:28 AM
To: slurm-dev
Subject: [slurm-dev] --exclusive together with --ntasks-per-node not worki
51 matches
Mail list logo