[slurm-dev] slurmctld causes slurmdbd to seg fault

2017-10-17 Thread Loris Bennett

Hi,

We have been having some with NFS mounts via Infiniband getting dropped
by nodes.  We ended up switching our main admin server, which provides
NFS and Slurm from one machine to another.

Now, however, if slurmdbd is started, as soon as slurmctld starts,
slurmdbd seg faults.  In the slurmdbd.log we have

  slurmdbd: error: We have more allocated time than is possible (7724741 > 
7012800) for cluster soroban(1948) from 2017-10-17T16:00:00 - 
2017-10-17T17:00:00 tres 1
  slurmdbd: error: We have more time than is possible 
(7012800+36720+0)(7049520) > 7012800 for cluster soroban(1948) from 
2017-10-17T16:00:00 - 2017-10-17T17:00:00 tres 1
  slurmdbd: Warning: Note very large processing time from hourly_rollup for 
soroban: usec=46390426 began=17:08:17.777
  Segmentation fault (core dumped)

and the corresponding output of strace is

  fstat(3, {st_mode=S_IFREG|0600, st_size=871270, ...}) = 0
  write(3, "[2017-10-17T17:09:04.168] Warnin"..., 132) = 132
  +++ killed by SIGSEGV (core dumped) +++

We're running 17.02.7.  Any ideas?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: file and directory permissions

2017-10-11 Thread Loris Bennett

Hi Marcus,

Marcus Wagner <wag...@itc.rwth-aachen.de> writes:

> Hello, everyone.
>
> I'm also fairly new to slurm, still in a conceptual rather than a test or
> productive phase. Currently I am still trying to find out where to create 
> which
> files and directories, on the host or in a network directory.
> I'm a little confused about the description in the manpage of slurm. conf.
> For example, the JobCheckpointDir should be accessible from both the primary 
> and
> backup controller. Now it is clear (at least I believe) that this has to be 
> done
> in the NCCR, for example. If the primary controller goes down, the backup
> controller must be able to access it.
> On the other hand, SlurmctldPidFile should also be available on both the 
> primary
> and backup controller. Since there is usually in /var/run, I assume that this
> should be a local path. It should also be unique on every controller.
> The manpage is not quite clear in its description.

Your understanding is correct.

> What about the SlurmctldLogFile, for example? Theoretically, both
> could write to the same file.

We have everything local except for the config files and the statesave location.

> If anyone has an advice or would like to tell me how it was solved on your 
> site,
> I would be very happy.
>
>
> best
> Marcus

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Loris Bennett

Hi Elisabetta,

Elisabetta Falivene <e.faliv...@ilabroma.com> writes:

> Upgrading Slurm 
>
> Thank you all for useful advices!
>
> So The 'jump' could not be a problem if there are no running jobs
> (which is my case as you guessed). Surely I'll report how it went
> doing it. I would like to do some test on a virtual machine, but
> really can't imagine how to replicate the exact situation of a 7Tb
> cluster locally...
>
> Just some other questions. How would you do the upgrade in the safer
> way? Letting aptitude do his job? Would you to debian 9? And the nodes
> must be upgraded in the same way one by one?

If no jobs are running, I would just let aptitude get on with it.

It there are no other reasons not to, I would upgrade to Debian 9.  In
this case, your version of Slurm will be 16.05 and thus not too old.

> Let's think about the worst case: upgrading nuke slurm. I don't really
> know well this machine's configuration. You would backup something
> else beside The database before upgrading?

The only other thing I backup is the statesave directory, but this only
interesting if you are upgrading while jobs are running.  In your case,
only the database is worth backing up, and even then, that's only really
interesting if you need the old data for statistical purposes, or you
need to maintain, say, fairshare information across the upgrade.

In bocca al lupo!

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Loris Bennett

Hi Elisabetta,

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> On 10/03/2017 03:29 PM, Elisabetta Falivene wrote:
>> I've been asked to upgrade our slurm installation. I have a slurm 2.3.4 on a
>> Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed it so I'm 
>> a
>> bit confused about how to do this and how to proceed without destroying
>> anything.
>>
>> I was thinking to upgrade at least to Jessie (Debian 8) but what about Slurm?
>> I've read carefully the upgrading section
>> (https://slurm.schedmd.com/quickstart_admin.html) of the doc, reading that 
>> the
>> upgrade must be done incrementally and not jumping from 2.3.4 to 17, for
>> example.
>
> Yes, you may jump max 2 versions per upgrade.
> Quoting https://slurm.schedmd.com/quickstart_admin.html#upgrade
>
>> Slurm daemons will support RPCs and state files from the two previous minor
>> releases (e.g. a version 16.05.x SlurmDBD will support slurmctld daemons and
>> commands with a version of 16.05.x, 15.08.x or 14.11.x). 
>
>
>> Stil is not clear to me precisely how to do this. How would you proceed if
>> asked to upgrade a cluster you just don't know nothing about? What would you
>> check? What version of o.s. and slurm would you choose? What would you 
>> backup?
>> And how would you proceed?
>>
>> Any info is gold! Thank you
>
> My 2 cents of information:
>
> My Slurm Wiki explains how to upgrade Slurm on CentOS 7:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> Probably the general method is the same for Debian.

Ole's pages on Slurm are indeed very useful (Thanks, Ole!).  I just
thought I point out that the limitation on only upgrading by 2 major
versions is for the case that you are upgrading a production system and
don't want to lose any running jobs.  If you are upgrading the whole
operating system, you are probably planning a downtime anyway and so
there won't be any such jobs.  In this case, there shouldn't in theory
be a problem - although I must admit that I wouldn't be that surprised
if converting the database from 2.3.4 to, say, 17.02.7 didn't go 100%
smoothly.  However, Debian users who just rely on Debian packages are
always going to face this problem of large version jumps between Debian
releases, and so it would be useful for the community to know how well
this works.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Job stuck in CONFIGURING, node is 'mix~'

2017-09-26 Thread Loris Bennett

Marcin Stolarek <stolarek.mar...@gmail.com> writes:

> Re: [slurm-dev] Re: Job stuck in CONFIGURING, node is 'mix~' 
>
> I think that all you needed was to set the node state to DOWN/FAIL and
> then RESUME without actually rebooting the node. Did you try this? I
> remember that in FAQ this was used for jobs stacked in CG state.

I think I may have tried this, but without success.  However, due to
another issue I was forced to reboot the server running slurmctld and
the problem is now resolved.  Incidentally, it also solved my problem
that idle nodes were not being powered down.

So I guess

  "Have you tried turning it off and then on again?"

is still often a valid suggestion.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Accounting using LDAP ?

2017-09-20 Thread Loris Bennett

Hi Chris,

Christopher Samuel <sam...@unimelb.edu.au> writes:

> On 20/09/17 15:53, Loris Bennett wrote:
>
>> Having said that, the only scenario I can see being easily automated is
>> one where each user only has one association, namely with their Unix
>> group, and everyone has equal shares.  This is our set up, but as soon
>> as you have, say, users with multiple associations and/or membership in some
>> associations confers more shares automation becomes very difficult.
>
> The user management system we use adds/removes users to accounts (which
> map to projects in our lingo) whenever a user is added/removed to a
> project as well as creating/deleting them.  Users can change their
> default project which changes their default account in Slurm.

Is the user management system homegrown or something more generally
available?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Accounting using LDAP ?

2017-09-19 Thread Loris Bennett

Christopher Samuel <sam...@unimelb.edu.au> writes:

> On 20/09/17 03:03, Carlos Lijeron wrote:
>
>> I'm trying to enable accounting on our SLURM configuration, but our
>> cluster is managed by Bright Management which has its own LDAP for users
>> and groups.   When setting up SLURM accounting, I don't know how to make
>> the connection between the users and groups from the LDAP as opposed to
>> the local UNIX.
>
> Slurm just uses the host's NSS config for that, so as long as the OS can
> see the users and groups then slurmdbd will be able to see them too.
>
> *However*, you _still_ need to manually create users in slurmdbd to
> ensure that they can run jobs, but that's a separate issue to whether
> slurmdbd can resolve users in LDAP.
>
> I would hope that Bright would have the ability to do that for you
> rather than having you handle it manually, but that's a question for Bright.

Our version, Bright Cluster Manger 5.2, doesn't have any features to help set up
accounting in Slurm, but then again it's a pretty old version and things
may have changed.

Having said that, the only scenario I can see being easily automated is
one where each user only has one association, namely with their Unix
group, and everyone has equal shares.  This is our set up, but as soon
as you have, say, users with multiple associations and/or membership in some
associations confers more shares automation becomes very difficult.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Does powering down as suspend action still work?

2017-09-19 Thread Loris Bennett

Hi,

Can any one confirm that powering off nodes still works as a suspend
action in 16.05 and/or 17.02?

Cheers,

Loris

BTW, the example of slurmctld logging contains the line:

  [May 02 15:31:25] Power save mode 0 nodes

Given that the code now reads

  if (((now - last_log) > 600) && (susp_total > 0)) {
info("Power save mode: %d nodes", susp_total);

I assume that shown line can no longer appear.

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Suspend stopped working - debug flag?

2017-09-19 Thread Loris Bennett

Hi,

We have been powering down idle nodes for many years now.  However, at
some point recently, this seems to have stopped working.  I can't
pinpoint exactly when the problem started, as the cluster is usually
full and so the situation in which nodes should be powered down doesn't
occur very often.

To try to debug the problem I have set

  DebugFlags=Power

but don't get any logging information about node suspension.  The man
page for 'slurm.conf' says that this provides debugging for

  Power management plugin

but does this refer to the mechanism for capping the power used by
nodes?  If so, what debug flags should I be using?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Job stuck in CONFIGURING, node is 'mix~'

2017-09-19 Thread Loris Bennett

Loris Bennett <loris.benn...@fu-berlin.de> writes:

> Hi,
>
> I have a node which is powered on and to which I have sent a job.  The
> output of sinfo is
>
>   PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
>   test   up 7-00:00:00  1   mix~ node001
>
> The output of squeue is
>
> JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
>   1795993  test 7_singleloris CF  24:29  1 node001
>
> I don't understand the node state 'mix~'.  If at all, I would only
> expect it to exist very briefly between 'idle~' and 'mix#'.  The '~' is
> certainly incorrect, as the node is not in a power-saving state, which
> in our case is powered-off.
>
> This problem may have existed in 16.05.10-2, but currently we are using
> 17.02.7. All other nodes in the cluster apart from one are functioning
> normally.
>
> Does anyone have any idea what we might be doing wrong?

I still don't know what the problem was, but I got the node back into a
sensible state by setting the state to FAIL, rebooting the node, and
then setting the state to RESUME.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Behaviour of Partition setting MaxTime

2017-09-18 Thread Loris Bennett

Hi Greg,

Greg Wickham <greg.wick...@kaust.edu.sa> writes:

> Hi,
>
> What is the behaviour when either root or the SlurmUser update the
> duration of an unprivileged user's running job to exceed the "MaxTime"
> setting of the partition?
>
> The documentation includes the text "This limit does not apply to jobs
> executed by SlurmUser or user root." however what about jobs executed
> by normal users that are extended by SlurmUser or root?

You don't say where the text you quote is from, but I have often updated
the 'TimeLimit' of a user job to extend its run-time beyond the
'MaxTime' of the partition.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Job stuck in CONFIGURING, node is 'mix~'

2017-09-12 Thread Loris Bennett

Hi Lyn,

Unfortunately, rebooting the node makes no difference to the state of
the node.  The job gets re-queued and the node goes back to 'mix~'.
What baffles me is that there is obviously some sort of communication
problem between the slurmctld on the admin node and the slurmd on the
compute node, but I can't find anything in the log files to indicate
what's going wrong.

Cheers,

Loris

Lyn Gerner <schedulerqu...@gmail.com> writes:

> Re: [slurm-dev] Job stuck in CONFIGURING, node is 'mix~' 
>
> Hi Loris,
>
> At least with earlier releases, I've not found a way to act directly upon the 
> job. However, if it's possible to down the node, that should requeue (or 
> cancel) the job.
>
> Best,
> Lyn
>
> On Tue, Sep 12, 2017 at 3:40 AM, Loris Bennett <loris.benn...@fu-berlin.de> 
> wrote:
>
>  Hi,
>
>  I have a node which is powered on and to which I have sent a job. The
>  output of sinfo is
>
>  PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
>  test up 7-00:00:00 1 mix~ node001
>
>  The output of squeue is
>
>  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>  1795993 test 7_single loris CF 24:29 1 node001
>
>  I don't understand the node state 'mix~'. If at all, I would only
>  expect it to exist very briefly between 'idle~' and 'mix#'. The '~' is
>  certainly incorrect, as the node is not in a power-saving state, which
>  in our case is powered-off.
>
>  This problem may have existed in 16.05.10-2, but currently we are using
>  17.02.7. All other nodes in the cluster apart from one are functioning
>  normally.
>
>  Does anyone have any idea what we might be doing wrong?
>
>  Cheers,
>
>  Loris
>
>  --
>  Dr. Loris Bennett (Mr.)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Job stuck in CONFIGURING, node is 'mix~'

2017-09-12 Thread Loris Bennett

Hi,

I have a node which is powered on and to which I have sent a job.  The
output of sinfo is

  PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
  test   up 7-00:00:00  1   mix~ node001

The output of squeue is

JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
  1795993  test 7_singleloris CF  24:29  1 node001

I don't understand the node state 'mix~'.  If at all, I would only
expect it to exist very briefly between 'idle~' and 'mix#'.  The '~' is
certainly incorrect, as the node is not in a power-saving state, which
in our case is powered-off.

This problem may have existed in 16.05.10-2, but currently we are using
17.02.7. All other nodes in the cluster apart from one are functioning
normally.

Does anyone have any idea what we might be doing wrong?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] sreport job sizesbyaccount over all accounts?

2017-07-25 Thread Loris Bennett

Hi,

I can do the following:

  $ sreport job sizesbyaccount -T mem -t hours start=2017-07-24 end=2017-07-24 
grouping=18,42,90
  

  Job Sizes 2017-07-24T00:00:00 - 2017-07-24T00:59:59 (3600 secs)
  TRES type is mem
  Time reported in Hours
  

Cluster   Account 0-17 TRES18-41 TRES42-89 TRES>= 90 TRES % 
of cluster 
  - - - - - - 
 
clusterdept01   1281631 19200122880 0   
58.30% 
clusterdept02 7 0 0 0   
 2.87% 
clusterdept03849353 0 0 0   
34.78% 
clusterdept04 88752 0 0 0   
 3.63% 
clusterdept05 10240 0 0 0   
 0.42% 

However I'd really just like to have the sums for the various TRES
groups over all departments (and then compare this with the values for
other time periods).

I'm going to read the data into R, so I can do the roll-up there, but I
wondered whether I can get the information directly from Slurm.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Elapsed time for slurm job

2017-07-24 Thread Loris Bennett

Dear Sema,

You need to set up accounting first:

  https://slurm.schedmd.com/accounting.html

You obviously won't have data for jobs which ran before accounting was
set up.  When you have done this, you will be able to do something like

  sacct -j 123456 -o jobid,elapsed

for subsequent jobs.

Read 'man sacct' for more info.

Regards

Loris

Sema Atasever <s.atase...@gmail.com> writes:

> Re: [slurm-dev] Re: Elapsed time for slurm job 
>
> Dear Loris,
>
> When i try this command (sacct -o 2893,elapsed) i get this error message 
> unfortunately:
>
> SLURM accounting storage is disabled
>
> How to solve this problem?
>
> Regards, Sema.
>
> On Mon, Jul 24, 2017 at 4:25 PM, Loris Bennett <loris.benn...@fu-berlin.de> 
> wrote:
>
>  Sema Atasever <s.atase...@gmail.com> writes:
>
>  > Elapsed time for slurm job
>  >
>  > Dear Friends,
>  >
>  > How can i retrieve elapsed time if the slurm job has completed?
>  >
>  > Thanks in advance.
>
>  sacct -o jobid,elapsed
>
>  See 'man sacct' or 'sacct -e' for the full list of fields.
>
>  Cheers,
>
>  Loris
>
>  --
>  Dr. Loris Bennett (Mr.)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Elapsed time for slurm job

2017-07-24 Thread Loris Bennett

Sema Atasever <s.atase...@gmail.com> writes:

> Elapsed time for slurm job 
>
> Dear Friends,
>
> How can i retrieve elapsed time if the slurm job has completed?
>
> Thanks in advance.

sacct -o jobid,elapsed

See 'man sacct' or 'sacct -e' for the full list of fields.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Loris Bennett

Hi Ole,

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> As a small contribution to the Slurm community, I've moved my collection of
> Slurm tools to GitHub at https://github.com/OleHolmNielsen/Slurm_tools.  These
> are tools which I feel makes the daily cluster monitoring and management a
> little easier.
>
> The following Slurm tools are available:
>
> * pestat Prints a Slurm cluster nodes status with 1 line per node and job 
> info.
>
> * slurmreportmonth Generate monthly accounting statistics from Slurm using the
> sreport command.
>
> * showuserjobs Print the current node status and batch jobs status broken down
> into userids.
>
> * slurmibtopology Infiniband topology tool for Slurm.
>
> * Slurm triggers scripts.
>
> * Scripts for managing nodes.
>
> * Scripts for managing jobs.
>
> The tools "pestat" and "slurmibtopology" have previously been announced to 
> this
> list, but future updates will be on GitHub only.
>
> I would also like to mention our Slurm deployment HowTo guide at
> https://wiki.fysik.dtu.dk/niflheim/SLURM
>
> /Ole

Thanks for sharing your tools.  Here are some brief comments

- psjob/psnode
  - The USERLIST variable makes the commands a bit brittle, since ps
will fail if you pass an unknown username.
- showuserjobs
  - Doesn't handle usernames longer than 8-chars (we have longer names)
  - The grouping doesn't seem quite correct.  As shown in the example
below, not all the users of the group appear under the group total
for the appropriate group:
  
UsernameJobs  CPUs   Jobs  CPUs  Group Further info
 =    =    =
GRAND_TOTAL  168  1089 55   451  ALL   running+idle=1540 CPUs 29 
users
GROUP_TOTAL   56   349 10   119  group01   running+idle=468 CPUs 8 users
user0127   324  452  group02   One, User
GROUP_TOTAL   27   324  452  group02   running+idle=376 CPUs 1 users
user0229   174  1 6  group01   Two, User
GROUP_TOTAL5   148 18   208  group03   running+idle=356 CPUs 4 users
user03 3   120 16   176  group03   Three, User
user041196  348  group01   Four, User
...

In general, maybe it would good to have a common config file, where things such 
as
paths to binaries, USERLIST and username lengths are defined.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Loris Bennett

Hi Dennis,

Dennis Tants <dennis.ta...@zarm.uni-bremen.de> writes:

> Hello Loris,
>
> Am 10.07.2017 um 07:39 schrieb Loris Bennett:
>> Hi Dennis,
>>
>> Dennis Tants <dennis.ta...@zarm.uni-bremen.de> writes:
>>
>>> Hi list,
>>>
>>> I am a little bit lost right now and would appreciate your help.
>>> We have a little cluster with 16 nodes running with SLURM and it is
>>> doing everything we want, except a few
>>> little things I want to improve.
>>>
>>> So that is why I wanted to upgrade our old SLURM 15.X (don't know the
>>> exact version) to 17.02.4 on my test machine.
>>> I just deleted the old version completely with 'yum erase slurm-*'
>>> (CentOS 7 btw.) and build the new version with rpmbuild.
>>> Everything went fine so I started configuring a new slurm[dbd].conf.
>>> This time I also wanted to integrate backfill instead of FIFO
>>> and also use accounting (just to know which person uses the most
>>> resources). Because we had no databases yet I started
>>> slurmdbd and slurmctld without problems.
>>>
>>> Everything seemed fine with a simple mpi hello world test on one and two
>>> nodes.
>>> Now I wanted to enhance the script a bit more and include working in the
>>> local directory of the nodes which is /work.
>>> To get everything up and running I used the script which I attached for
>>> you (it also includes the output after running the script).
>>> It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
>>> before doing the mpi hello world.
>>> But it seems that srun does not know $SLURM_JOB_NAME even though it is
>>> there.
>>> /work/tants belongs to the correct user and has rwx permissions.
>>>
>>> So did I just configure something wrong or what happened here? Nearly
>>> the same example is working on our cluster with
>>> 15.X. The script is only for testing purposes, thats why there are so
>>> many echo commands in there.
>>> If you see any mistake or can recommend better configurations I would
>>> glady hear them.
>>> Should you need any more information I will provide them.
>>> Thank you for your time!
>> Shouldn't the variable be $SBATCH_JOB_NAME?
>>
>> Cheers,
>>
>> Loris
>>
>
> when I use "echo $SLURM_JOB_NAME" it will tell me the name I specified
> with #SBATCH -J.
> It is not working with srun in this version (it was working in 15.x).
>
> However, when I now use "echo $SBATCH_JOB_NAME" it is just a blank
> variable. As told by someone from the list,
> I used the command "env" to verify which variables are available. This
> list includes SLURM_JOB_NAME
> with the name I specified. So $SLURM_JOB_NAME shouldn't be a problem.
>
> Thank you for your suggestion though.
> Any other hints?
>
> Best regards,
> Dennis

The manpage of srun says the following:

  SLURM_JOB_NAMESame as -J, --job-name except within an existing
allocation, in which case it is ignored to avoid
using the batch job’s name as the name of each
job step.

This sounds like it might mean that if you submit a job script via
sbatch and in this script call srun, the variable might not be defined.
However, the wording is a bit unclear and I have never tried this
myself.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: slurm 17.2.06 min memory problem

2017-07-09 Thread Loris Bennett

Hi Roy,

Roe Zohar <roezo...@gmail.com> writes:

> slurm 17.2.06 min memory problem 
>
> Hi all,
> I have installed the last Slurm version and I have noticed a strange behavior 
> with the memory allocated for jobs.
> In my slurm conf I am having:
> SelectTypeParameters=CR_LLN,CR_CPU_Memory
>
> Now, when I am sending a new job with out giving it a --mem amount, it 
> automatically assign it all the server memory, which mean I am getting only 
> one job per server.
>
> I had to add DefMemPerCPU in order to get around that.
>
> Any body know why is that?
>
> Thanks,
> Roy

What value of SelectType are you using?  Note also that CR_LLN schedules
jobs to the least loaded nodes and so until all nodes have one job, you
will not more than one job per node.  See 'man slurm.conf'.

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-09 Thread Loris Bennett

Hi Dennis,

Dennis Tants <dennis.ta...@zarm.uni-bremen.de> writes:

> Hi list,
>
> I am a little bit lost right now and would appreciate your help.
> We have a little cluster with 16 nodes running with SLURM and it is
> doing everything we want, except a few
> little things I want to improve.
>
> So that is why I wanted to upgrade our old SLURM 15.X (don't know the
> exact version) to 17.02.4 on my test machine.
> I just deleted the old version completely with 'yum erase slurm-*'
> (CentOS 7 btw.) and build the new version with rpmbuild.
> Everything went fine so I started configuring a new slurm[dbd].conf.
> This time I also wanted to integrate backfill instead of FIFO
> and also use accounting (just to know which person uses the most
> resources). Because we had no databases yet I started
> slurmdbd and slurmctld without problems.
>
> Everything seemed fine with a simple mpi hello world test on one and two
> nodes.
> Now I wanted to enhance the script a bit more and include working in the
> local directory of the nodes which is /work.
> To get everything up and running I used the script which I attached for
> you (it also includes the output after running the script).
> It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
> before doing the mpi hello world.
> But it seems that srun does not know $SLURM_JOB_NAME even though it is
> there.
> /work/tants belongs to the correct user and has rwx permissions.
>
> So did I just configure something wrong or what happened here? Nearly
> the same example is working on our cluster with
> 15.X. The script is only for testing purposes, thats why there are so
> many echo commands in there.
> If you see any mistake or can recommend better configurations I would
> glady hear them.
> Should you need any more information I will provide them.
> Thank you for your time!

Shouldn't the variable be $SBATCH_JOB_NAME?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Length of possible SlurmDBD without HA

2017-07-06 Thread Loris Bennett

Hi,

On the Slurm FAQ page

  https://slurm.schedmd.com/faq.html

it says the following:

  52. How critical is configuring high availability for my database?

  Consider if you really need mysql failover. Short outage of
  slurmdbd is not a problem, because slurmctld will store all data
  in memory and send it to slurmdbd when it's back operating. The
  slurmctld daemon will also cache all user limits and fair share
  information.

I was wondering how long a "short outage" can be.  Presumably this is
determined by the amount of free memory on the server running slurmctld,
the number of jobs, and the amount of memory required per job.

So roughly how much memory will be required per job?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Slurm query

2017-07-04 Thread Loris Bennett

"suprita.bot...@wipro.com" <suprita.bot...@wipro.com> writes:

> Hi ,
>
> Just wanted to know,What is the meaning of * in the partition name.
>
> When we type the following command:
>
> Sinfo :
>
> The o/p comes as:
>
> [root@punehpcdl01 ~]# sinfo
>
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
>
> debug* up infinite 1 idle punehpcdl01

It means that 'debug' is the default partition.  See 'man sinfo'.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-04 Thread Loris Bennett

Hi Sourabh,

sourabh shinde <sourabhshinde2...@gmail.com> writes:

> Re: [slurm-dev] Re: Multifactor Priority Plugin for Small clusters 
>
> Thank you guys for your reply. 
>
> @Loris : Yes, i had a look at Gang Scheduling which does not fits my
> requirements.  In my case, a job which is scheduled should complete
> its execution and then the next job should start. this is what i need.
>
> @Chris : I had already set up accounting. but the resource limits was
> new. I have set limits on QOS and Users now, constraints are working
> well.
>
> @Ole : Wiki page was really helpful. :)
>
> Also, does modifying the mult-ifactor logic would really help in my
> case ?  if Not, what else can i do in order to go atleast close to
> what i need to achieve(the example referring in my previous post) ?

You may need to rethink what you are trying to achieve.  You seem to
expect that the priority of a job intrinsically has something to do
with the resources allocated the job.  This may be the case if you
define your factors appropriately, but primarily the two are not
connected and the priority just determines which order jobs should be
started in at a given point in time.

From your example, what you seem to want is that three users, each with
a different degree of what I'll call "importance" can all start jobs at
the same time, but, depending on the amount of "importance", they can
use different numbers of nodes.  With multifactor fairshare, the
priorities would have to be essentially equal and you would have to
restrict the number of nodes for the different degrees of "importance".
This brings with it other problems, such as what happens if only the
user with the lowest "importance" has jobs in the queue.  Can he or she
use all the nodes, or do some remain idle in case a more "important" job
comes along?  

I think if you and your users can accept the idea of fairshare over a
period rather than at every point in time, you might save yourself a
great deal of time and trouble with Slurm.

Regards

Loris

> Thanks and Regards
> Sourabh 
>
> Regards,
> Sourabh Shinde
> +49 176 4569 5546
> sourabhshinde.cf
>
> On Mon, Jul 3, 2017 at 8:02 AM, Loris Bennett <loris.benn...@fu-berlin.de> 
> wrote:
>
>  Hi Sourabh,
>
>  sourabh shinde <sourabhshinde2...@gmail.com> writes:
>
>  > Multifactor Priority Plugin for Small clusters
>  >
>  > Hello Everyone,
>  >
>  > I am new to SLURM and trying to run it locally on my PC. I am using
>  > Multifactor plugin to assign priorities for the job. The problem is
>  > multi factor doesn’t work as needed on small clusters. I tried
>  > assigning weightage to the factors as per my need but the scheduler
>  > always schedule the job on FIFO basis.
>  >
>  > I am trying to find some alternative where making changes to the
>  > priority plugin code could make it work on small clusters.
>  >
>  > for e.g
>  >
>  > If I have 12 nodes on my cluster, and if 3 users A,B and C with QOS
>  > low, normal and high respectively submit their job for execution. I
>  > want that SLURM should assign not all nodes to the User A. Atleast 1
>  > node should be assigned to the users B and C which are having low and
>  > normal priority. how can I achieve this ?
>  >
>  > PS: Gang scheduling and preemption are not possible in my case.
>  >
>  > Any help would be appreciated.
>  >
>  > Thanks in advance.
>  >
>  > Regards,
>  > Sourabh Shinde
>
>  I don't think you can achieve what you want with Fairshare and
>  Multifactor Priority. Fairshare looks at distributing resources fairly
>  between users over a *period* of time. At any *point* in time it is
>  perfectly possible for all the resources to be allocated to one user.
>  It is only over time that the allocation of resources will average out
>  to correspond to how you have configured the shares.
>
>  If you only have a small amount of resources and a small number of
>  users, this may not work very well. Have you looked at Gang scheduling
>  without premption?
>
>  Cheers,
>
>  Loris
>
>  --
>  Dr. Loris Bennett (Mr.)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] sacct: --unit applied to NNodes

2017-07-04 Thread Loris Bennett

Hi,

With version 16.05.10-2, the option '--units' get applied incorrectly to
the column 'NNodes':

$ sacct -u user1234 -o jobid,nnodes,ncpus,reqmem,maxrss,elapsed -S 2017-07-01 
--units=G
   JobID   NNodes  NCPUS ReqMem MaxRSSElapsed 
  -- -- -- -- 
1601832 0.00G 164Gc11-01:00:30 
1601832.bat+0.00G  14Gc  0.01G 11-01:00:30 
1601832.0   0.00G  94Gc  7.42G 11-01:00:28 
1699682 0.00G 164Gc  16:52:49 
1699682.0   0.00G  34Gc  16:52:48 

Has this been fixed in more recent versions?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Rewarding good memory requirement estimation on shared nodes?

2017-07-03 Thread Loris Bennett

Hi,

When nodes are being shared, it is desirable for users to estimate
memory requirements as accurately as possible.  One way to do this would
be to have a cron job bump up the priority of the jobs of users who have
a high average value of MaxRSS/ReqMem.  A more elegant way would be to
add a component to the multifactor priority plugin which would do the
same thing.  Is there any way to do this, short of writing one's own
version?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-03 Thread Loris Bennett

Hi Sourabh,

sourabh shinde <sourabhshinde2...@gmail.com> writes:

> Multifactor Priority Plugin for Small clusters 
>
> Hello Everyone,
>
> I am new to SLURM and trying to run it locally on my PC. I am using
> Multifactor plugin to assign priorities for the job. The problem is
> multi factor doesn’t work as needed on small clusters. I tried
> assigning weightage to the factors as per my need but the scheduler
> always schedule the job on FIFO basis.
>
> I am trying to find some alternative where making changes to the
> priority plugin code could make it work on small clusters.
>
> for e.g
>
> If I have 12 nodes on my cluster, and if 3 users A,B and C with QOS
> low, normal and high respectively submit their job for execution. I
> want that SLURM should assign not all nodes to the User A. Atleast 1
> node should be assigned to the users B and C which are having low and
> normal priority. how can I achieve this ?
>
> PS: Gang scheduling and preemption are not possible in my case. 
>
> Any help would be appreciated.
>
> Thanks in advance. 
>
> Regards,
> Sourabh Shinde

I don't think you can achieve what you want with Fairshare and
Multifactor Priority.  Fairshare looks at distributing resources fairly
between users over a *period* of time.  At any *point* in time it is
perfectly possible for all the resources to be allocated to one user.
It is only over time that the allocation of resources will average out
to correspond to how you have configured the shares.

If you only have a small amount of resources and a small number of
users, this may not work very well.  Have you looked at Gang scheduling
without premption?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Loris Bennett

Hi Ole,

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?

Cheers,

Loris

Lachlan Musicman <data...@gmail.com> writes:

> Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database 
>
> We did it in place, worked as noted on the tin. It was less painful
> than I expected. TBH, your procedures are admirable, but you shouldn't
> worry - it's a relatively smooth process.
>
> cheers
> L.
>
> --
> "Mission Statement: To provide hope and inspiration for collective action, to 
> build collective power, to achieve collective transformation, rooted in grief 
> and rage but pointed towards vision and dreams."
>
> - Patrisse Cullors, Black Lives Matter founder
>
> On 26 June 2017 at 20:04, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:
>
>  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step 
> seems to me to be the upgrade of the slurmdbd database, which may also take 
> tens of minutes.
>
>  I thought it's a good idea to test the slurmdbd database upgrade locally on 
> a drained compute node in order to verify both correctness and the time 
> required.
>
>  I've developed the dry run upgrade procedure documented in the Wiki page 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
>  Question 1: Would people who have real-world Slurm upgrade experience kindly 
> offer comments on this procedure?
>
>  My testing was actually successful, and the database conversion took less 
> than 5 minutes in our case.
>
>  A crucial step is starting the slurmdbd manually after the upgrade. But how 
> can we be sure that the database conversion has been 100% completed?
>
>  Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything 
> rolled up" indeed signifies that conversion is complete?
>
>  Thanks,
>  Ole
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Loris Bennett

Michael Jennings <m...@lanl.gov> writes:

> On Thursday, 22 June 2017, at 04:19:04 (-0600),
> Loris Bennett wrote:
>
>>   rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26-4.el6.src.rpm
>
> Remove the equals signs.  I have no problems building pdsh 2.29 via:
>
>   rpmbuild --rebuild --with slurm --without torque pdsh-2.29-1.el7.src.rpm
>
> for EL5, EL6, and EL7.

Thanks, Michael.  That did the trick.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Loris Bennett

Hi Ole,

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> You may want to throw in a uniq command in case the user runs multiple jobs on
> some nodes:
>
> # squeue -u user123 -h -o "%N" | tr '\n' , | xargs scontrol show 
> hostlistsorted
> b[135,135,135]
>
> This gives a better list:
>
> # squeue -u user123 -h -o "%N" | uniq | tr '\n' , | xargs scontrol show
> hostlistsorted
> b135
>
> BTW, if you enter a non-existent user, the output is an unexpected error 
> message
> and a long help info :-)
>
> /Ole

I have just realised that pdsh, which was what I wanted the consolidated
list for, has a Slurm module, which knows about Slurm jobs.  I followed
your instructions here:

  https://wiki.fysik.dtu.dk/niflheim/SLURM#pdsh-parallel-distributed-shell

with some modifications for EPEL6.  However, in the 'rebuild' line

  rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26-4.el6.src.rpm

fails with

  --with=slurm: unknown option

The page https://github.com/grondo/pdsh implies it should be 

  rpmbuild --rebuild --with-slurm --without-torque pdsh-2.26-4.el6.src.rpm

but this also fails:

  --with-slurm: unknown option

Any ideas what I'm doing wrong?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Loris Bennett

Hi,

Kent Engström <k...@nsc.liu.se> writes:

> "Loris Bennett" <loris.benn...@fu-berlin.de> writes:
>> Hi,
>>
>> I can generate a list of node lists on which the jobs of a given user
>> are running with the following:
>>
>>   $ squeue -u user123 -h -o "%N"
>>   node[006-007,014,016,021,024]
>>   node[012,094]
>>   node[005,008-011,013,015,026,095,097-099]
>>
>> I would like to merge these node lists to obtain
>>
>>   node[005-016,021,024,026,094-095,097-099]
>>
>> I can do the following:
>>
>>   $ squeue -u user123 -h -o "%N" | xargs -I {} scontrol show hostname {} | 
>> sed ':a;N;$!ba;s/\n/,/g' | xargs scontrol show hostlistsorted
>>   node[005-016,021,024,026,094-095,097-099]
>>
>> Would it be worth adding an option to allow the delimiter in the output
>> of 'scontrol show hostname' to be changed from an newline to, say, a
>> comma?  That would permit easier manipulation of node lists without
>> one having to google the appropiate sed magic.
>
> Hi,
>
> slighly off topic, but if you are willing to install and use an external
> program that is not part of SLURM itself, I might perhaps be allowed to
> advertise the python-hostlist package?
>
> Your example would be:
>
> squeue -u user123 -h -o "%N" | hostlist -c -
>
> (read as "Collapse several hostlist into one, and take the input from
> stdin").
>
> You find it at:
>   https://pypi.python.org/pypi/python-hostlist
>   https://www.nsc.liu.se/~kent/python-hostlist/
>
> Best Regards,
> --
> Kent Engström, National Supercomputer Centre
> k...@nsc.liu.se, +46 13 28 

In fact, I had already spotted your ad from 2010:

  https://groups.google.com/forum/#!topic/slurm-devel/n6x2WgGmDls

but was wondering whether there might be any interest in having a
solution more tightly integrated solution.

I am not averse to installing an external program and I do rather like
Python, despite having done most of my programming in Perl.  However, my
experience of installing Python software for users is that the packet
management is somewhat fragmented and brittle.  Nevertheless, I was able
to install your package (with only minor moaning from pip) and it works
fine.

Thanks,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Loris Bennett

Hi Jens,

Well golfed.  I hadn't realised that 'hostlistsorted' will take
mutlitple sorted lists and resort them.

Cheers,

Loris

Jens Dreger <jens.dre...@physik.fu-berlin.de> writes:

> I think
>
>   squeue -u user123 -h -o "%N" | tr '\n' , | xargs scontrol show 
> hostlistsorted
>
> should also do it... Slightly better to remember ;)
>
> On Thu, Jun 22, 2017 at 02:59:11AM -0600, Loris Bennett wrote:
>> 
>> Hi,
>> 
>> I can generate a list of node lists on which the jobs of a given user
>> are running with the following:
>> 
>>   $ squeue -u user123 -h -o "%N"
>>   node[006-007,014,016,021,024]
>>   node[012,094]
>>   node[005,008-011,013,015,026,095,097-099]
>> 
>> I would like to merge these node lists to obtain
>> 
>>   node[005-016,021,024,026,094-095,097-099]
>> 
>> I can do the following:
>> 
>>   $ squeue -u user123 -h -o "%N" | xargs -I {} scontrol show hostname {} | 
>> sed ':a;N;$!ba;s/\n/,/g' | xargs scontrol show hostlistsorted
>>   node[005-016,021,024,026,094-095,097-099]
>> 
>> Would it be worth adding an option to allow the delimiter in the output
>> of 'scontrol show hostname' to be changed from an newline to, say, a
>> comma?  That would permit easier manipulation of node lists without
>> one having to google the appropiate sed magic.
>> 

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Loris Bennett

Hi,

I can generate a list of node lists on which the jobs of a given user
are running with the following:

  $ squeue -u user123 -h -o "%N"
  node[006-007,014,016,021,024]
  node[012,094]
  node[005,008-011,013,015,026,095,097-099]

I would like to merge these node lists to obtain

  node[005-016,021,024,026,094-095,097-099]

I can do the following:

  $ squeue -u user123 -h -o "%N" | xargs -I {} scontrol show hostname {} | sed 
':a;N;$!ba;s/\n/,/g' | xargs scontrol show hostlistsorted
  node[005-016,021,024,026,094-095,097-099]

Would it be worth adding an option to allow the delimiter in the output
of 'scontrol show hostname' to be changed from an newline to, say, a
comma?  That would permit easier manipulation of node lists without
one having to google the appropiate sed magic.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: ExitCode 139

2017-06-21 Thread Loris Bennett

Hi Djibril,

Djibril Mboup <djibril.mb...@aims-senegal.org> writes:

> Re: [slurm-dev] Re: ExitCode 139 
>
> I see M Loris, but I don't know why I got this error. Does it mean I don't 
> have enough memory to execute my code? You can see my batch script below:
>
> #SBATCH --partition=x
> #SBATCH --account=x
> #SBATCH --nodes=2
> #SBATCH --ntasks=4 
> #SBATCH --cpus-per-task=20 
> #SBATCH --time=01:00:00
> #SBATCH --exclusive
>
> srun hostname -s| sort -u > mpd.hosts
>
> mpiexec.hydra -f mpd.hosts -perhost $nb_cpu -n $SLURM_NTASKS ./code -c 
> config.info

I can't tell how much memory your program needs just by looking at the
batch script.  It depends on what the program does and, possibly, what
parameters you pass to it in 'config.info'.

The error could be due to the program running out of memory, but it
could also be due to your program doing something wrong, such as trying
to write beyond the bounds of an array.

This is probably unrelated, but the value of --cpus-per-task is quite
high.  Do the nodes have 20 CPUs each?

Cheers,

Loris

> On 21 June 2017 at 05:45, Loris Bennett <loris.benn...@fu-berlin.de> wrote:
>
>  Hi Djibril,
>
>  Djibril Mboup <djibril.mb...@aims-senegal.org> writes:
>
>  > ExitCode 139
>  >
>  > Hello,
>  > Since yesterday, I have got an error after submitting a job. The exit code 
> 139:0 remind you something.
>  > Thanks
>
>  Try searching for "exit code 139" with your favourite search engine.
>  You will find that it indicates that your program experienced a
>  segmentation fault.
>
>  Cheers,
>
>  Loris
>
>  --
>  Dr. Loris Bennett (Mr.)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: ExitCode 139

2017-06-20 Thread Loris Bennett

Hi Djibril,

Djibril Mboup <djibril.mb...@aims-senegal.org> writes:

> ExitCode 139 
>
> Hello, 
> Since yesterday, I have got an error after submitting a job. The exit code 
> 139:0 remind you something.
> Thanks

Try searching for "exit code 139" with your favourite search engine.
You will find that it indicates that your program experienced a
segmentation fault.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Loris Bennett

Hi Ole,

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> On 06/20/2017 04:32 PM, Loris Bennett wrote:
>> We do our upgrades while full production is up and running.  We just stop
>> the Slurm daemons, dump the database and copy the statesave directory
>> just in case.  We then do the update, and finally restart the Slurm
>> daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
>> so, but that was due a rather brittle configuration provided by our
>> vendor (the statesave path contained the Slurm version), rather than
>> Slurm itself and was before we had acquired any Slurm expertise
>> ourselves.
>
> 1. When you refer to "daemons", do you mean slurmctld, slurmdbd as well as
> slurmd on all compute nodes?  AFAIK, the recommended procedure upgrading and
> restarting in this order: 1) slurmdbd, 2) slurmctld, 3) slurmd on nodes.

We don't stop slurmd on the nodes.  The nodes only get the new Slurm
version on the next reboot.  The documentation mentions this possibility
of this kind of rolling upgrade and we haven't had any problems with it.

> 2. When you mention statesave, I suppose this is what you refer to:
> # scontrol show config | grep -i statesave
> StateSaveLocation   = /var/spool/slurmctld

Yes, that's right.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Loris Bennett

Hi Nick,

We do our upgrades while full production is up and running.  We just stop
the Slurm daemons, dump the database and copy the statesave directory
just in case.  We then do the update, and finally restart the Slurm
daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
so, but that was due a rather brittle configuration provided by our
vendor (the statesave path contained the Slurm version), rather than
Slurm itself and was before we had acquired any Slurm expertise
ourselves.

Paul: How do you pause the jobs?  SIGSTOP all the user processes on the
cluster?

Cheers,

Loris


Paul Edmon <ped...@cfa.harvard.edu> writes:

> If you follow the guide on the Slurm website you shouldn't have many 
> problems. We've made it standard practice here to set all partitions to DOWN 
> and suspend all the jobs when we do upgrades. This has led to
> far greater stability. So we haven't lost any jobs in an upgrade. The only 
> weirdness we have seen is if jobs exit while the DB upgrade is going. 
> Sometimes it can leave residual jobs in the DB that were properly closed
> out. This is why we pause all the jobs as it makes it such that we don't end 
> up with jobs exiting before the DB is back. In 16.05+ you have the:
>
> sacctmgr show runawayjobs
>
> Feature which can clean up all those orphan jobs. So its not as much a 
> concern anymore.
>
> Beyond that we follow the guide at the bottom of this page:
>
> https://slurm.schedmd.com/quickstart_admin.html
>
> I haven't tried going two major versions at once though. The docs indicate 
> that it should work fine. We generally try to keep pace with current stable.
>
> Given that you only have 100,000 jobs your upgrade should probably go fairly 
> quick. I could imagine around 10-15 minutes. Our DB has several million jobs 
> and it takes about 30 min to an hour depending on what
> operations are bing done.
>
> -Paul Edmon-
>
> On 06/20/2017 09:37 AM, Nicholas McCollum wrote:
>
>  I'm about to update 15.08 to the latest SLURM in August and would appreciate 
> any notes you have on the process. 
>
>  I'm especially interested in maintaining the DB as well as associations. I'd 
> also like to keep the pending job list if possible.
>
>  I've only got around 100,000 jobs in the DB so far, since January. 
>
>  Thanks
>
>  Nick McCollum
>  Alabama Supercomputer Authority
>
>  On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote:
>
>  Yeah, that sounds about right. Changes between major versions can take 
>  quite a bit of time. In the past I've seen upgrades take 2-3 hours for 
>  the DB.
>
>  As for ways to speed it up. Putting the DB on newer hardware if you 
>  haven't already helps quite a bit (depends on architecture as to how 
>  much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell 
>  and saw a factor of 3-4 speed improvement). Upgrading to the latest 
>  version of MariaDB if you are on an old version of MySQL can get you 
>  about 30-40%.
>
>  Doing all of these whittled our DB upgrade times for major upgrades to 
>  about 30 min or so.
>
>  Beyond that I imagine some more specific DB optimization tricks could be 
>  done, but I'm not a DB admin so I won't venture to say.
>
>  -Paul Edmon-
>
>  On 06/20/2017 08:42 AM, Tim Fora wrote:
>  > Hi,
>  >
>  > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
>  > start. Logs show most of the time was spent on this step and other table
>  > changes:
>  >
>  > adding column admin_comment after account in table
>  >
>  > Does this sound right? Any ideas to help things speed up.
>  >
>  > Thanks,
>  > Tim
>  >
>  >
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Loris Bennett

Hi Tim,

Tim Fora <tf...@riseup.net> writes:

> Hi,
>
> Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
> start. Logs show most of the time was spent on this step and other table
> changes:
>
> adding column admin_comment after account in table
>
> Does this sound right? Any ideas to help things speed up.

It probably depends a great deal on how many entries you have in your
database and what sort of hardware you have.  We are up to around 1.6
million jobs and have never purged anything.  I seem to remember the
last update between major releases taking long enough to allow me to get
slightly uneasy, but not long enough for me to really worry, so I guess
it was probably around 10-15 minutes.  Our CPUs are around 6 years old,
but the DB is on an SSD.

HTH

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Can't get formatted sinfo to work...

2017-06-19 Thread Loris Bennett

Hi Mehmet,

"Belgin, Mehmet" <mehmet.bel...@oit.gatech.edu> writes:

> I’m troubleshooting an issue that causes NHC to fail to offline a bad
> node. The node offline script uses formatted “sinfo" to identify the
> node status, which returns blank for some reason. Interestingly, sinfo
> works without custom formatting.
>
> Could this be due to a bug in the current version (17.02.4)? Would
> someone mind trying the following commands in an older Slurm version
> to compare the output?
>
> [root@devel-vcomp1 nhc]# sinfo --version
> slurm 17.02.4
>
> [root@devel-vcomp1 nhc]# sinfo -o '%t %E' -hn `hostname`
>
> (NOTHING!)
>
> [root@devel-vcomp1 nhc]# sinfo -hn `hostname`
> test up infinite 0 n/a
> vtest* up infinite 0 n/a
>
> (OK)
>
> Thanks!
>
> -Mehmet
>

Seem to work as expected with our version:

[root@node003 ~]# sinfo --version 
slurm 16.05.10-2
[root@node003 ~]# sinfo -o '%t %E' -hn `hostname`
mix none
[root@node003 ~]# sinfo -hn `hostname`
test   up3:00:00  0n/a 
main*  up 14-00:00:0      1mix node003
gpuup 14-00:00:0  0n/a 

HTH,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Slurm accounting problem with GPFS

2017-06-09 Thread Loris Bennett

> Am 09.06.2017 um 12:02 schrieb Loris Bennett:
>> 
>> Hi Marcel,
>> 
>> Marcel Sommer <marcelsommer...@gmail.com> writes:
>> 
>>> Slurm accounting problem with GPFS 
>>>
>>> Hi, 
>>>
>>> we are running slurm 2.6.5 and we have a master and a backup
>>> controller configuration. We use the filetxt plugin and the accounting
>>> logfile is stored in a folder on a central filesystem (GPFS).
>>>
>>> The problem that we have is that the accounting stuck after a couple
>>> of days. When I restart the slurm daemon it works fine but a few days
>>> later the problem comes again.
>>>
>>> Have you any suggestions?
>>>
>>> Cheers,
>>> Marcel
>> 
>> The combination of GPFS, filetxt, and, in particular, such an old version
>> of Slurm is probably quite rare, so I suspect not many people will be
>> able to help you.  Unless, that is, it is a known problem, in which case
>> it has probably been fixed in a later version.  In addition, it now says
>> the following on the Slurm download page:
>> 
>>   Due to a security vulnerability (CVE-2016-10030), all versions of
>>   Slurm prior to 15.08.13 or 16.05.8 are no longer available.
>> 
>> So you need to do an update anyway.  And as the intermediate versions
>> are now no longer available, you basically just need to set up Slurm
>> again from scratch.
>> 
>> Sorry about that,
>> 
>> Loris
>> 
"marcelsommer...@gmail.com" <marcelsommer...@gmail.com> writes:

> Hi Loris,
>
> thank you for the quick reply. Unfortunately this old version comes from
> Ubuntu 14.04 what we have installed on the nodes.
>
> OK, the easier solution for us is to try to install an newer slurm
> version on this OS.
>
> ...or can a slurmdbd backend solve this problem?

Hard to say, but even if it does fix this problem, you will still be
using a very old version with a serious security problem (although if
you are not using a prolog script, you won't be affected).

If I were you, I would install a fairly current version.  You will get a
lot of new functionality and more people on the mailing list will be
able to help you if you do have any issues.  But, of course, it's your
call.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Slurm accounting problem with GPFS

2017-06-09 Thread Loris Bennett

Hi Marcel,

Marcel Sommer <marcelsommer...@gmail.com> writes:

> Slurm accounting problem with GPFS 
>
> Hi, 
>
> we are running slurm 2.6.5 and we have a master and a backup
> controller configuration. We use the filetxt plugin and the accounting
> logfile is stored in a folder on a central filesystem (GPFS).
>
> The problem that we have is that the accounting stuck after a couple
> of days. When I restart the slurm daemon it works fine but a few days
> later the problem comes again.
>
> Have you any suggestions?
>
> Cheers,
> Marcel

The combination of GPFS, filetxt, and, in particular, such an old version
of Slurm is probably quite rare, so I suspect not many people will be
able to help you.  Unless, that is, it is a known problem, in which case
it has probably been fixed in a later version.  In addition, it now says
the following on the Slurm download page:

  Due to a security vulnerability (CVE-2016-10030), all versions of
  Slurm prior to 15.08.13 or 16.05.8 are no longer available.

So you need to do an update anyway.  And as the intermediate versions
are now no longer available, you basically just need to set up Slurm
again from scratch.

Sorry about that,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: understanding of Purge in Slurmdb.conf

2017-06-07 Thread Loris Bennett

Hi Rohan,

Rohan Gadalkar <rohangadal...@gmail.com> writes:

> understanding of Purge in Slurmdb.conf 
>
> Hello Slurm Team,
>
> I am new-bee to the world of SLURM. I was going through the
> Slurmdb.conf page, where I came across the PurgeEventAfter etc. as
> mentioned below.
>
> Unable to understand the below things, as in each of the below most of
> the lines are copied.
>
> I would request you to share any kind of diagrammatic explanation
> which will clear the confusion in understanding this topic.
>
> Below is the link and topics which I want you to explain for me.
>
> https://slurm.schedmd.com/slurmdbd.conf.html
>
> PurgeEventAfter;PurgeJobAfter;PurgeResvAfter;PurgeStepAfter;
> PurgeSuspendAfter;PurgeTXNAfter;PurgeUsageAfter
>
> Looking forward to your KB, as it will help me and my colleagues to
> understand this.

You really need to be more specific about what you don't understand.
The documentation you refer to seems to me to be fairly clear.  As
described, the parameters just allow you to set various time periods
after which various types of entries in the database will be purged.
I'm not sure how a diagram would help in this case.

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: srun - replacement for --x11?

2017-06-06 Thread Loris Bennett

Edward Walter <ewal...@cs.cmu.edu> writes:

> On 06/06/2017 05:29 AM, Loris Bennett wrote:
>>
>> Hi,
>>
>> We used to tell users that they could specify the '--x11' option
>> to run a graphical application interactively within a Slurm job.
>> With version 16.05.10-2 this option is no longer available.
>>
>> Is the canonical solution now to use the scripts given here:
>>
>>https://slurm.schedmd.com/faq.html#terminal
>>
>> (or one of the various modifications/forks)?
>>
> Doesn't that functionality come from a spank plugin?
> https://github.com/hautreux/slurm-spank-x11
>
> Hope that helps.
>
> -Ed

It may well do, but the last commit is from 11th December 2014.  Up to
now I thought '--x11' was the shiny new replacement for the SPANK
plugin.

I have tried

  https://github.com/jabl/sinteractive.git 

but it didn't really work for me:

  $ sinteractive
  Waiting for JOBID 1551288 to start
  No screen session found.
  No screen session found.
  No screen session found.
  No screen session found.
  There is no screen to be detached matching slurm1551288.
  Connection to node001 closed.

I guess I'll just have to try out some of the other forks, etc.

Cheers,

Loris


-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] srun - replacement for --x11?

2017-06-06 Thread Loris Bennett

Hi,

We used to tell users that they could specify the '--x11' option 
to run a graphical application interactively within a Slurm job.
With version 16.05.10-2 this option is no longer available.

Is the canonical solution now to use the scripts given here:

  https://slurm.schedmd.com/faq.html#terminal

(or one of the various modifications/forks)?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Wrong Python version used in batch MPI job

2017-06-02 Thread Loris Bennett

Hi,

My system Python is version 2.6.6.  Using RedHat Software Collections I
have successfully built a program using Python 3.5.1 and Intel's MPI.
When I run a job with

scl enable python35 bash
module load gpaw/test
gpaw -P 4 test

via Slurm, I get the following error:

File 
"/cm/shared/apps/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/bin/mpiexec",
 line 187
  except EOFError, e:
 ^
  SyntaxError: invalid syntax

This is because the mpiexec script is written for Python 2 but is being
interpreted by Python 3.

Has anyone had a similar issue and come up with a solution?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Multinode MATLAB jobs

2017-06-01 Thread Loris Bennett

Hi Benjamin,

Benjamin Redling <benjamin.ra...@uni-jena.de> writes:

> Hi,
>
> Am 31.05.2017 um 10:39 schrieb Loris Bennett:
>> Does any one know whether one can run multinode MATLAB jobs with Slurm
>> using only the Distributed Computing Toolbox?  Or do I need to be
>> running a Distributed Computing Server too?
>
> if you can get a hand on the overpriced and underwhelming DCS (at least
> up to the 2016b Linux variant the mdce service neither has startup
> scripts with LSB tags, nor systemd units; only the very first annoyance),
> the following might be a consolation:
> "
> Access to all eligible licensed toolboxes or blocksets with a single
> server license on the distributed computing resource
> "
> https://www.mathworks.com/products/distriben/features.html
>
>
> (We currently use DCS without Slurm integration and thous are bad
> citizens considering the license pool we have to share.
> But running DCS without scheduler integration is bad in many ways. e.g.
> proper security levels don't cooperate with plain LDAP, default security
> runs job as root [hello, inaccessible NFS shares] so it seems users
> either start single node parallel jobs apart from DCS or DCS-Slurm
> integration is mandatory and you get all the benefits -- license count,
> security level, multi-node)

Thanks for the information.  Currently we only have one user wanting to
run jobs on more cores than we have on individual nodes and I also need
to check how his code scales before shelling out for DCS.

We are also bad citizens in that we allow some interactive and batch
usage of MATLAB licenses from the same pool.  To stop jobs failing
because the licenses are all in use, jobs have to specify a reservation
containing the available licenses.  This is updated regularly by a cron
job which parses the output of the license manager.  It works, but it's
a bit of a nasty hack.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread Loris Bennett

John DeSantis <desan...@usf.edu> writes:

> Loris,
>
>>> Does any one know whether one can run multinode MATLAB jobs with Slurm
>
> I completely missed the _multinode_ part.  Feel free to ignore, and sorry to 
> all for the noise in
> the list!

No problem.  The bit about having separate cluster profiles was new to
me, so I still learned something :-)

Loris

> John DeSantis
>
> John DeSantis wrote:
>> 
>> Loris,
>> 
>>> Does any one know whether one can run multinode MATLAB jobs with Slurm 
>>> using only the 
>>> Distributed Computing Toolbox?  Or do I need to be running a Distributed 
>>> Computing Server
>>> too?
>> 
>> Our users are able to use only the Distributed Computing Toolbox by ensuring 
>> that they:
>> 
>> 1.)  Request a single node with the desired number of processors per parpool 
>> [0]; 2.)  Ensure
>> that a separate cluster profile is created with each job.
>> 
>> By taking the two steps above, users can submit multiple jobs without MATLAB 
>> crashing stating
>> that a pool is already open.
>> 
>> [0]  Nodes in our cluster depending on their age have between 12-24 
>> processors available.  If
>> a user wants a parpool of 24, they must request either a constraint or a 
>> combination of -N 1
>> and --ntasks-per-node=24, for example.
>> 
>> HTH, John DeSantis
>> 
>> Loris Bennett wrote:
>> 
>>> Hi,
>> 
>>> Does any one know whether one can run multinode MATLAB jobs with Slurm 
>>> using only the 
>>> Distributed Computing Toolbox?  Or do I need to be running a Distributed 
>>> Computing Server
>>> too?
>> 
>>> Cheers,
>> 
>>> Loris
>> 
>> 
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Multinode MATLAB jobs

2017-05-31 Thread Loris Bennett

Hi,

Does any one know whether one can run multinode MATLAB jobs with Slurm
using only the Distributed Computing Toolbox?  Or do I need to be
running a Distributed Computing Server too?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] RE: Slurm job priorities

2017-04-27 Thread Loris Bennett
 equals 2**63, i.e. 1 larger than the largest
signed 64-bit integer, which looks like some sort of overflow or type
mismatch.

Unless anyone else has any ideas, I would be tempted to say that your
database is borked and you need to start over again.

Sorry not be more helpful :-(

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] RE: Slurm job priorities

2017-04-27 Thread Loris Bennett

Hi David,

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hi Loris,
>
> Thank you for your reply. Below is the output from the sprio -n command -- 
> David
>
> [djb1@blue34 slurm]$ sprio -n
>   JOBID PRIORITY   AGEFAIRSHARE  JOBSIZEPARTITION  QOS
>
>   259920.0717171  nan0.0010979  1.000  
> 0.000 
>   259930.0717113  nan0.0010979  1.000  
> 0.000 
>   259940.0716807  nan0.0010979  1.000  
> 0.000 
>   259950.0716741  nan0.0010979  1.000  
> 0.000 
>   259960.0716667  nan0.0010979  1.000  
> 0.000 
>   259970.0716592  nan0.0010979  1.000  
> 0.000 
>   259990.0104456  nan0.0005946  1.000  
> 0.000 
>   260000.0102257  nan0.0005946  1.000  
> 0.000 
>   260010.0098041  nan0.0010979  1.000  
> 0.000 
>   260030.0095379  nan0.0005946  1.000  
> 0.000 
>   260040.0094436  nan0.0005946  1.000  
> 0.000 
>   260050.0094114  nan0.0005946  1.000  
> 0.000 
>   260060.0093742  nan0.0005946  1.000  
> 0.000 
>   260070.0091526  nan0.0005946  1.000  
> 0.000 
>   260080.0091154  nan0.0005946  1.000  
> 0.000 
>   260090.0090832  nan0.0005946  1.000  
> 0.000 
>   260100.0087988  nan0.0005946  1.000  
> 0.000 
>   260110.0087054  nan0.0005946  1.000  
> 0.000 
>   260120.0086119  nan0.0005946  1.000  
> 0.000 
>   260140.0054638  nan0.0005946  1.000  
> 0.000 
>   260160.0026513  nan0.0005946  1.000  
> 0.000 
>   259880.0717221  nan0.0010979  1.000  
> 0.000

What about

  sshare -la

?  That should show you something about how the fairshare values is
calculated from the raw shares and the CPU usage.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] RE: Slurm job priorities

2017-04-27 Thread Loris Bennett

Hi David,

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hi Loris,
>
> Thank you again for your comments. I thought that I understood the
> situation better, and I have followed your basic model re setting up
> shares. So, for example, I have users in the group "research" and
> added shares accordingly...see below. In other words all the users in
> the research group have an equal share of the pie. On the other hand I
> see that "sprio" is still reporting "nan" for the fairshare. Have I
> missed something fundamental here? I even waited some time before
> submitting another test job, however the situation was unchanged.
>
>  Some clues would really be appreciated, please.
>
> Best regards,
> David
>
> [djb1@blue34 slurm]$ sacctmgr list assoc tree format=account,user,fairshare
>  Account   User Share 
>  -- - 
> root1 
>  root  root 1 
>  gpuusers   2 
>   gpuusers djb1 1 
>   gpuusers  hpc 1 
>  research  15 
>   research  ab24g12 1 
>   research cica1d14 1 
>   research djb1 1 
>   research  dpm1u13 1 
>   research  gtj1y12 1 
>   research  hpc 1 
>   research  icw 1 
>   research  jag1g13 1 
>   research  jec1f12 1 
>   research  lmr1u16 1 
>   research   mb1a10 1 
>   research  mjp1m12 1 
>   research   ph1m12 1 
>   research  srw1g10 1 
>   research   tp1v09 1
>
> [djb1@blue34 slurm]$ sprio -l 
>   JOBID USER   PRIORITYAGE  FAIRSHAREJOBSIZE  
> PARTITIONQOSNICE TRES
>   25992  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25993  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25994  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25995  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25996  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25997  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25988  mjp1m12 -922337203 63nan  1   
> 1000  0   0 
>   25999 djb1 -922337203  2nan  1   
> 1000  0   0 
>   26000     djb1 -922337203  2nan  1   
> 1000  0   0
> 

What does

  sprio -n

show (this shows the normalised, i.e. unweighted, priority factors)?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] RE: Slurm job priorities

2017-04-26 Thread Loris Bennett

Hi David,

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hi Loris,
>
> Thank you for your reply. The output from "sprio -l" is:
>
>   JOBID USER   PRIORITYAGE  FAIRSHAREJOBSIZE  
> PARTITIONQOSNICE TRES
>   25988  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25992  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25993  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25994  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25995  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25996  mjp1m12 -922337203  2nan  1   
> 1000  0   0 
>   25997  mjp1m12 -922337203  2nan  1   
> 1000  0   0
>
> I've also attached a copy of our slurm.conf, if that helps. Any advice that 
> you could give us would be appreciated.

A value of 'nan' for 'FAIRSHARE' is not what you want.  I suspect you
haven't set up any shares.  What does the following produce?

  sacctmgr list assoc tree format=account,user,fairshare

For me this looks something like:

 Account   User Share 
 -- - 
root1 
 root  root 1 
 bcp  169 
  biology  15 
   group01  3 
group01   alice 1 
group01 bob 1 
group01   carol 1 
   group02  1 
group02dave 1 
   ...

For each user and account you need to set up the shares.  Check the
official 'sacctmgr' page:

  https://slurm.schedmd.com/sacctmgr.html

Ole Holm Nielsen also has some helpful information on the following
page:

  https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting

In general it can be a bit of a faff setting up and maintaining shares.
All our users have equal shares and only belong to one account, so when
we add a users, we just automatically increment all the shares up to the
top of the hierarchy and decrement correspondingly when the user is
deleted.

HTH

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Slurm job priorities

2017-04-26 Thread Loris Bennett

Hi David,

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hello,
>
> I guess that may a simple question for someone more experienced with slurm 
> scheduling than us. When jobs are queuing in our cluster we find that we get 
> a lot of these messages in our slurmctld.log
>
> error: Job 25766 priority exceeds 32 bits
>
> I cannot find any mention or discussion of this type of error in the mailing 
> list archives, and so I wondered if someone could please explain how to 
> prevent these errors. We have tried reducing the fair share
> component to no avail…. 
>
> PriorityWeightAge=1000
>
> PriorityWeightFairshare=10
>
> PriorityWeightJobSize=1000
>
> PriorityWeightPartition=1000
>
> PriorityWeightQOS=1 # don't use the qos factor
>
> Best regards,
>
> David

You would need to show us a little more information.  The weights are
just that - weights.  If you had, say, a partition with a very large
priority, then multiplying it by 1000 could push the total priority over
the size of a 32-bit integer.

What kinds of values does 'sprio -l' show?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

2017-04-05 Thread Loris Bennett

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> On 04/05/2017 03:59 PM, Loris Bennett wrote:
>
>> We are running 16.05.10-2 with power-saving.  However, we have noticed a
>> problem recently when nodes are woken up in order to start a job.  The
>> node will go from 'idle~' to, say, 'mixed#', but then the job will fail
>> and the node will be put in 'down*'.  We have turned up the log level to
>> 'debug' with the DebugFlag 'Power', but this hasn't produced anything
>> relevant.  The problem is, however, resolved if the node is rebooted.
>>
>> Thus, there seems to be some disturbance of the communication between
>> the slurmd on the woken node and the slurmctd on the administration
>> node.  Does anyone have any idea what might be going on?
>
> We have seen something similar with Slurm 16.05.10.
>
> How many nodes are in your network?  If there are more than about 400 devices 
> in
> the network, you must tune the kernel ARP cache of the slurmctld server, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

Thanks for the link, but we have fewer than 120 nodes, so we are along
way from the 512-device limit.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Nodes in state 'down*' despite slurmd running

2017-04-05 Thread Loris Bennett

Hi,

We are running 16.05.10-2 with power-saving.  However, we have noticed a
problem recently when nodes are woken up in order to start a job.  The
node will go from 'idle~' to, say, 'mixed#', but then the job will fail
and the node will be put in 'down*'.  We have turned up the log level to
'debug' with the DebugFlag 'Power', but this hasn't produced anything
relevant.  The problem is, however, resolved if the node is rebooted.

Thus, there seems to be some disturbance of the communication between
the slurmd on the woken node and the slurmctd on the administration
node.  Does anyone have any idea what might be going on?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: MaxSubmitPU

2017-03-13 Thread Loris Bennett

Hi Danny,

Danny Marc Rotscher <danny.rotsc...@tu-dresden.de> writes:

> Hello,
>
> when I want to add the MaxSubmitPU parameter to one of my qos, it fails with 
> the following error output:
>
> sacctmgr modify qos where name=interactive set MaxSubmitPU=1
>  Unknown option: MaxSubmitPU=1
>  Use keyword 'where' to modify condition
>
> Does anybody have a solution for my problem?
>
> Kind reagrds,
> Danny

Looking at the man page, but not having tried it out, I would guess that
it should be

  sacctmgr modify qos where name=interactive set MaxSubmitJobsPerUser=1

The form shortened for 'MaxSubmitPU' is probably just used for display.

HTH

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] error: chdir(/var/log): Permission denied

2017-03-03 Thread Loris Bennett

Hi,

I've updated to 16.05.9 and everything seems to be working fine.
However, when slurmctld is started, in the file

  /var/log/slurmctld

I get the error

  [2017-03-03T10:45:13.096] error: chdir(/var/log): Permission denied

As I say, everything seems to be working, so is this error an, er,
error?

Cheers,

Loris


-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Slurm version 17.02.0 is now available

2017-02-27 Thread Loris Bennett

Danny Auble <d...@schedmd.com> writes:

> After 9 months of development we are pleased to announce the availability of
> Slurm version 17.02.0.
>
> A brief description of what is contained in this release and other notes about
> it is contained below.  For a fuller description please consult the
> RELEASE_NOTES file available in the source.
>
> Thanks to all involved!
>
> Slurm downloads are available from https://schedmd.com/downloads.php.

This link currently (09:50 CET) just returns the following:

  [an error occurred while processing this directive] 

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Power outage causes wrong reports

2017-02-22 Thread Loris Bennett

Hi Lucas,

Lucas Vuotto <l.vuott...@gmail.com> writes:

> Hi all,
> sreport was showing that an user was using more CPU hours per week
> than available. After checking the output of sacct, we found that some
> jobs from an array didn't ended:
>
> $ sacct -j 69204 -o jobid%-14,state%6,start,elapsed,end
>
>  JobID  State   StartElapsed End
>
> -- -- --- -- ---
> 69204_[1-1000] FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
> 69204_1FAILED 2016-11-09T17:46:44 71-20:25:55Unknown
> 69204_2FAILED 2016-11-09T17:46:44 71-20:25:55Unknown
> [...]
> 69204_295  FAILED 2016-11-09T17:46:46 71-20:25:53Unknown
> 69204_296  FAILED 2016-11-09T17:46:46 71-20:25:53Unknown
> 69204_297  FAILED 2016-11-09T17:46:46   00:00:00 2016-11-09T17:46:46
> [...]
> 69204_999  FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
>
> It seems that somehow those jobs got stucked (~72 days after
> 2016-11-09 is today, 2017-01-20, and that's why the wrong reports).
> scancel says that 69204 is an invalid job id.
>
> Any idea on how to fix this? We're thinking about deleting the entries
> of those jobs in the DB. Is it safe to run "arbitrary" commands in the
> DB, bypassing slurmdbd?
>
> Thanks in advance.

The following might also be useful:

https://groups.google.com/d/msg/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ

The code heuristically decides how to deal with inconsistencies in the
database and produces an SQL script to fix them as well as a second
script to roll back the changes.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Permissible updates

2017-02-17 Thread Loris Bennett

Hi,

Some years ago, I had a problem with understanding when updates are
possible:

  https://groups.google.com/forum/#!topic/slurm-devel/CNu9iDbQl7U

As then, the documentation says

  Slurm permits upgrades of up to two major or minor updates
  (e.g. 14.03.x or 14.11.x to 16.08.x) without loss of jobs or other
  state information

I still read this as "two major or *two* minor updates", even though I
know that's not what's meant.  I think it would be clearer to write:

  Slurm permits upgrades between any two versions whose major release
  numbers differ by two or less (e.g. 14.11.x or 15.08.x to 16.05.x)
  without loss of jobs or other state information

I have updated Slurm a few times already and I am a native English
speaker, but I still stumble over the current wording.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Standard suspend/resume scripts?

2017-02-15 Thread Loris Bennett

Lachlan Musicman <data...@gmail.com> writes:

> Re: [slurm-dev] Standard suspend/resume scripts? 
>
> If you are looking to suspend and resume jobs, use scontrol:
>
> scontrol suspend 
> scontrol resume 
>
> https://slurm.schedmd.com/scontrol.html
>
> The docs you are pointing to look more like taking nodes offline in times of 
> low usage?

Yes, because that's what I'm interested in ;-)

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Standard suspend/resume scripts?

2017-02-15 Thread Loris Bennett

Hi,

I was looking around on the web for standard scripts to use as for
SuspendProgram and ResumeProgram, but didn't find much other than the
following:

  https://slurm.schedmd.com/power_save.html

Would 'node_shutdown' need to do much more than

  ssh $host shutdown -P now

and 'node_start' more than something like

  ipmitool -H $host chassis power on

?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: sacctmgr case insensitive

2017-02-09 Thread Loris Bennett

Hi Daniel,

Daniel Ruiz Molina <daniel.r...@caos.uab.es> writes:

> Hi,
>
> I'm adding user to accounts in accounting information. However, some users in 
> my
> system have capital letters and when I try to add them to their account,
> sacctmgr returns this message: "There is no uid for user 'MY_USER' Are you 
> sure
> you want to continue?".
> Then, if I click "y", user is added to its accounting but its name has been
> changed to all lower case (I could check with "sacctmgr list user" and 
> "sacctmgr
> list account"), so I suppose there is no relationship between real user (with
> capital letters) and the user "modified" in sacctmgr.
>
> How could I solve this (avoiding, of course, change user names in system)?
>
> Thanks.
>

As it says in the man page for 'sacctmgr':

  user   The login name. Only lowercase usernames are supported.

If you are importing the usernames from another system, you could filter
them in some way.  We import from a central university LDAP server to our
own LDAP server and can thus tweak the attributes or add attributes,
such as 'loginShell'.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Setting a partition QOS, etc

2017-02-01 Thread Loris Bennett

Hi David,

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hello,
>
> This is hopefully a very simple set of questions for someone. I’m evaluating
> slurm with a view to replacing our existing torque/moab system, and I’ve been
> reading about defining partitions and QoSs. I like the idea of being able to 
> use
> a QoS to throttle user activity -- for example to set maxcpus/user, 
> maxjobs/user
> and maxnodes/user, etc, etc. Also I’m going to define a very simple set of
> partitions to reflect the different types of nodes in the cluster. For example
>
> Batch – normal compute nodes
>
> Highmem – high memory nodes
>
> Gpu – gpu nodes

We have a similar range of hardware, albeit with three different
categories of memory, but we decided against setting these up as
separate partitions.  The disadvantage is that small memory jobs can
potentially clog up the large memory nodes; the advantage is that small
memory jobs can use the large memory nodes if they would otherwise be
empty.

> So presumably it makes sense to associate the “normal” QOS with the batch 
> queue
> and define throttling limits as needs. Then define corresponding QoSs for the
> highmem and gpu partitions. In this respect do the QOS definitions override 
> any
> definitions on the PartitionName line? For example does QOS Maxwall override
> MaxTime?

The hierarchy of the limits is given here:

https://slurm.schedmd.com/resource_limits.html

However, unless you have specific needs, having limits defined on both
the partitions and QOS might be overkill.  If, as you say later, you
have a heterogeneous job mix, you probably also have a heterogeneous
user base, some of whom might find the setup confusing.  For that
reason, I would start with a fairly simple configuration and only add to
that as the need arises.

> Also I suspect I’ll need to define a test queue with a high level of 
> throttling
> to enable users to get a limited number of small test jobs through the system
> quickly. In this respect does it make sense for my batch and test partitions 
> to
> overlap either partially or completely? At any one time the test partition 
> will
> only take a few resources out of the pool of normal compute nodes?

We originally had a separate test partition, but have now moved to a
'short' QOS on the main batch partition which increases the priority for
a limited number of jobs with a short maximum run-time.  If you have
overlapping batch and test partitions, the batch jobs can clog the test
nodes, although you could have different priorities for each partition.

> Another issue is that we do have a large mix of small and large jobs. In our
> torque/moab cluster we make use of the XFACTOR component to make sure that 
> small
> jobs don’t get starved out of the system. I don’t think there is an analog of
> this parameter in slurm, and so I need to understand how to enable smaller 
> jobs
> to compete with the larger jobs and not get starved out. Using slurm I
> understand that the backfill mechanism and priority flags like
> PriorityFavorSmall=NO and SMALL_RELATIVE_TO_TIME can help the situation. What
> are your thoughts?

We also have a very heterogeneous job mix, but don't have any problem
with small jobs starving.  On the contrary, as we share nodes, small
jobs with moderate memory requirements have an advantage, as there are
always a few cores available somewhere in the cluster, even when it is
quite full.  For this reason we favour large jobs slighty.

> Your advice on the above points would be appreciated, please.
>
> Best regards,
>
> David

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Daytime Interactive jobs

2017-01-29 Thread Loris Bennett

"Vicker, Darby (JSC-EG311)" <darby.vicke...@nasa.gov> writes:

[snip (54 lines)]

> In the end, I just took the debug nodes out of the normal partition.  In other
> words, we have debug nodes 24/7.  This was the simplest thing to do to avoid 
> the
> redefinition of partitions via cron as Gary suggested.  Our cluster has grown
> quite a bit since we first set up this debug standing reservation so having
> dedicated debug nodes isn't as big of a deal for us now.  But if there is an
> elegant way to accomplish the same setup under slurm, I would appreciate 
> knowing
> how to do that.

[snip (27 lines)]

We used to have dedicated partition with a couple of test/debug nodes.
Now, however, we have moved to a single partition and have defined a QOS
for short run-times which has an much larger priority weight than the
standard QOS.  This allows users to, say, run tests of large MPI jobs.
The total number of jobs a user can have in the test/debug QOS is
limited.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] RE: A little bit help from my slurm-friends

2017-01-16 Thread Loris Bennett

David WALTER <david.wal...@ens.fr> writes:

> Dear Loris,
>
> Thanks for your response !
>
> I'm going to look on this features in slurm.conf.  I only configured
> the CPUs, Sockets per node. Do you have any example or link to
> explain me how it's working and what can I use ?

It's not very complicated.  A feature is just a label, so if you had
some nodes with Intel processors and some with AMD, you could attach the
features, e.g.

NodeName=node[001,002] Procs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=42000 State=unknown Feature=intel
NodeName=node[003,004] Procs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=42000 State=unknown Feature=amd

Users then just request the required CPU type in their batch scripts as
a constraint, e.g:

#SBATCH --constraint="intel"

> My goal is to respond to people needs and launch their jobs as fast as
> possible without losing time when one partition is idle whereas the
> others are fully loaded.

The easiest way to avoid the problem you describe is to just have one
partition.  If you have multiple partitions, the users have to
understand what the differences are so that they can choose sensibly.

> That's why I thought the fair share factor was the best solution

Fairshare won't really help you with the problem that one partition
might be full while another is empty.  It will just affect the ordering
of jobs in the full partition, although the weight of the partition term
in the priority expression can affect the relative attractiveness of the
partitions.

In general, however, I would suggest you start with a simple set-up.
You can always add to it later to address specific issues as they arise.
For instance, you could start with one partition and two QOS: one for
normal jobs and one for test jobs.  The latter could have a higher
priority, but only a short maximum run-time and possibly a low maximum
number of jobs per user.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: A little bit help from my slurm-friends

2017-01-16 Thread Loris Bennett

Hello David,

David WALTER <david.wal...@ens.fr> writes:

> Hello everyone,
>
> I need some advice or some good practices as I’m a new SLURM’s administrator… 
> in
> fact a new cluster manager !
>
> Everything is OK, jobs running well etc… But now I would like to configure
> priority on jobs to improve the efficiency of my cluster. I see I have to
> activate “Multifactor Priority plugin” to get rid of the FIFO's default 
> behavior
> of SLURM.
>
> So there are 6 factors and the fair share one is interesting me but do you 
> some
> advices ? I’m managing a small cluster (I think), 40 nodes, with 4 different
> generations (and different hardware) and I would like to optimize it. For now 
> I
> set 4 partitions, 1 per generation that may be not the best solution ?

An alternative would be to have just one partition and to distinguish
the the machines via 'features defined in slurm.conf.  It depends a bit
on how different the machines are and how interested in these
differences the users are.

> Do you think I can just use the “job size” and “partition” and maybe the “age”
> factors ? Maybe you need more information ?

I would have thought that in general you want to use 'fairshare' as
well, but that obviously depends on what you are trying to achieve.

> In any case thanks for your help
>
> David

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: where to find completed job execution command

2017-01-06 Thread Loris Bennett

Sean McGrath <smcg...@tchpc.tcd.ie> writes:

> Hi,
>
> On Thu, Jan 05, 2017 at 02:29:11PM -0800, Prasad, Bhanu wrote:
>
>> Hi,
>> 
>> 
>> Is there a convenient command like `scontrol show job id` to check more info 
>> of jobs that are completed
>
> Not to my knowledge.
>
>> 
>> or any command to check the sbatch command run in that particular job
>
> How we do this is with the slurmctld epilog script:
>
>   EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld
>
> Which does the following:
>
>   /usr/bin/scontrol show job=$SLURM_JOB_ID > 
> $recordsdir/$SLURM_JOBID.record
>
> The `scontrol show jobid=` record is saved to the file system for future
> reference if it is needed.

It might be worth using the option '--oneliner' to print out the record
in a single line.  You could then parse it more easily for, say, then
inserting the data into a table in a database.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Loris Bennett

Loris Bennett <loris.benn...@fu-berlin.de> writes:

> Hi,
>
> Ulf Markwardt <ulf.markwa...@tu-dresden.de> writes:
>
>> Dear all,
>>
>> we are using CR_Core_Memory, granularity of our jobs is cores, so:
>> shared nodes. And all is well, jobs get killed once they use too much
>> memory, cgroups are in place.
>>
>> But.
>> A user wants to have a node explicitely, not caring about number of CPU
>> cores and amount of RAM in that specific node (ranging e.g. from 12
>> cores to 24, and from 32 to 256 GB), but he wants to use ALL resources.
>>
>> At the moment, I see no way to tell this Slurm. - OK, I can ask for 24
>> cores and 64 GB in a node, but then I do not get the chance to run on 12
>> cores/32 GB.
>>
>> Is there already a parameter in Slurm to handle this?
>>
>> Thanks,
>> Ulf
>
> Wouldn't the sbatch option
>
>   --exclusive
>
> help?

D'oh. This obviously isn't what you want.  I some how overlooked the
point about using all the resources available ("ALL" just wasn't in caps
enough ;-) for me).

However, on a system with shared nodes, I would have thought that if the
jobs can run on only 12 cores, throughput would be generally increased
by always specifying that rather that anything larger.  That way you
reduce wait times for entire nodes with more cores and you usually get
better scaling with, say, two parallel 12-core jobs than with one
24-core job.  Obviously in your specific case, this may not be true.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Unrestricted use of a node

2016-12-05 Thread Loris Bennett

Hi,

Ulf Markwardt <ulf.markwa...@tu-dresden.de> writes:

> Dear all,
>
> we are using CR_Core_Memory, granularity of our jobs is cores, so:
> shared nodes. And all is well, jobs get killed once they use too much
> memory, cgroups are in place.
>
> But.
> A user wants to have a node explicitely, not caring about number of CPU
> cores and amount of RAM in that specific node (ranging e.g. from 12
> cores to 24, and from 32 to 256 GB), but he wants to use ALL resources.
>
> At the moment, I see no way to tell this Slurm. - OK, I can ask for 24
> cores and 64 GB in a node, but then I do not get the chance to run on 12
> cores/32 GB.
>
> Is there already a parameter in Slurm to handle this?
>
> Thanks,
> Ulf

Wouldn't the sbatch option

  --exclusive

help?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Slurm license management question

2016-12-05 Thread Loris Bennett
ce_available_string = ",".join(licence_strings_available)
scontrol_string = scontrol + \
  ' update reservationname=licenses_' + vendor + \
  ' licenses=' + slurm_licence_available_string

if args.initialise:
print(slurm_licence_total_string)
continue

if args.dryrun:
print(scontrol_string)
continue

# Actually update the reservation
os.system(scontrol_string)

# Strings used for testing
#
#string = 'Users of MATLAB_Distrib_Comp_Engine:  (Total of 16 licenses issued;  
Total of 0 licenses in use)'
#string = 'Users of Wavelet_Toolbox:  (Error: 2 licenses, unsupported by 
licensed server)'

---


-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-27 Thread Loris Bennett

Tuo Chen Peng <tp...@nvidia.com> writes:

> I thought ‘scontrol update’ command is for letting slurmctld to pick up any
> change in slurm.conf.
>
> But after reading the manual again, it seems this command is instead to change
> the setting at runtime, instead of reading any change from slurm.conf.
>
> So is restarting slurmctld the only way to let it pick up changes in 
> slurm.conf?

No.  You can also do

  scontrol reconfigure

This does not restart slurmctld.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Slurm license management question

2016-10-27 Thread Loris Bennett

Baker D.J. <d.j.ba...@soton.ac.uk> writes:

> Hello,
>
> Looking at the Slurm documentation I see that it is possible to handle basic
> license management (this is the link http://slurm.schedmd.com/licenses.html). 
> In
> other words software licenses can be treated as a resource, however things
> appear to be fairly rudimentary at the moment – at least that’s my impression.
> We are used to doing license management in moab, and if we don’t have that
> properly implemented is it not the end of the world, however not ideal.
>
> One situation that we would like to be able to deal with is a FlexLM 3 server
> redundancy situation. So, for example, our Comsol licenses are served out in
> this fashion. Is this something that slurm can deal with, and, if so, how can 
> it
> be done? Any advice including slurm’s short comings and/or future plans in 
> this
> respect would be useful, please.
>
> Best regards,
>
> David

We have licenses, such as Intel compiler licenses, which can be used
both interactively outside the queuing system and within Slurm jobs.

We use a script which parses the output of the FlexLM manager and
modifies a reservation in which the licenses are defined.  This is run
as a cron job once a minute.  It's a bit of a kludge and obviously won't
work well if there is a lot of contention for licenses.

I can post the code if anyone is interested.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] scontrol: update multiple jobs?

2016-09-27 Thread Loris Bennett

Hi,

The update jobs section of the manpage for scontrol 15.08.8 says

  JobId=
 Identify the job(s) to be updated.  The job_list may be a comma
 separated list of job IDs.

However, trying this, I get the following error:

  $ scontrol update jobid=1135541,1135542 timelimit=+1:00:00
  scontrol: error: Invalid job ID 1135541,1135542

Is this a documentation error?  Does the syntax work for more recent
versions?

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] Re: Jobs which started and completed within an interval

2016-07-25 Thread Loris Bennett

"Loris Bennett" <loris.benn...@fu-berlin.de>
writes:

> Hi,
>
> Is it possible to find jobs which both started and completed in a given
> interval?
>
> I am investigating an incident, during which an abnormally high load
> occurred on one of our storage servers.  To this end I would like to
> know whether the beginning and and of any jobs correspond to the
> beginning and end of the high-load period.
>
> I can do something like
>
>   sacct -S 2016-07-13T22:20 -E 2016-07-14T06:20 -s RUNNING -X | grep COMPLETED
>
> to get jobs which were running in the period and subsequently completed,
> but this includes jobs which were running both before and after the
> period in question.

As this specific question didn't elicit any responses, I would be
interested in answers to these more general ones:

  Do you try to relate events within your system to specific, possibly
  misbehaving jobs?  If so, how?  If not, why not?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Loris Bennett

Husen R <hus...@gmail.com> writes:

> Re: [slurm-dev] Re: number of processes in slurm job 
>
> Hi,
>
> Thanks for your reply !
>
> I use this sbatch script 
>
> #!/bin/bash
> #SBATCH -J mm6kn2_03
> #SBATCH -o 6kn203-%j.out
> #SBATCH -A necis
> #SBATCH -N 3
> #SBATCH -n 16
> #SBATCH --time=05:30:00
>
> mpirun ./mm.o 6000

You need to tell 'mpirun' how many processes to start.  If you do not,
probably all cores available will be used.  So it looks like you have 6
cores per node and thus 'mpirun' starts 18 processes.  You should write
some thing like

  mpirun -np ${SLURM_NTASKS} ./mm.o 6000

Cheers,

Loris

> regards,
>
> Husen
>
> On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett
> <loris.benn...@fu-berlin.de> wrote:
>
> Husen R <hus...@gmail.com> writes:
> 
> > number of processes in slurm job
> 
> 
> >
> > Hi all,
> >
> > I tried to run a job on 3 nodes (N=3) with 16 number of processes
> > (n=16) but slurm automatically changes that n value to 18 (n=18).
> >
> > I also tried to use other combination of n values that are not equally
> > devided by N but Slurm automatically changes those n values to values
> > that are equally devided by N.
> >
> > How to change this behavior ?
> > I need to use a specific value of n for experimental purpose.
> >
> > Thank you in advance.
> >
> > Regards,
>     >
> > Husen
> 
> 
> You need to give more details about what you did. How did you set the
> number of processes?
> 
> Cheers,
> 
> Loris
> 
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email
> loris.benn...@fu-berlin.de
> 
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Loris Bennett

Husen R <hus...@gmail.com> writes:

> number of processes in slurm job 
>
> Hi all,
>
> I tried to run a job on 3 nodes (N=3) with 16 number of processes
> (n=16) but slurm automatically changes that n value to 18 (n=18).
>
> I also tried to use other combination of n values that are not equally
> devided by N but Slurm automatically changes those n values to values
> that are equally devided by N.
>
> How to change this behavior ?
> I need to use a specific value of n for experimental purpose.
>
> Thank you in advance.
>
> Regards,
>
> Husen

You need to give more details about what you did.  How did you set the
number of processes?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Output of 'sinfo -Nel' not aggregated

2016-07-01 Thread Loris Bennett

Hi Chris,

Christopher Samuel <sam...@unimelb.edu.au>
writes:

> On 30/06/16 17:37, Loris Bennett wrote:
>
>> With version slurm 15.08.8, the node-oriented output of 'sinfo' is not
>> longer aggregated.  Instead I get a line for each node, even if the data
>> for multiple nodes are the same,
>
> I think it's a deliberate change, and the it's the manual page
> that is out of step. Looks like it changed around 15.08.4.
>
> $ git describe eafec3c0c0cb977361a5b10388d5469136e1ef38
> slurm-15-08-3-1-94-geafec3c
>
> commit eafec3c0c0cb977361a5b10388d5469136e1ef38
> Author: Morris Jette <je...@schedmd.com>
> Date:   Mon Nov 23 15:48:15 2015 -0800
>
> sinfo: Print each node one separate line with -N option
>
> diff --git a/src/sinfo/sinfo.c b/src/sinfo/sinfo.c
> index 2613629..78f2857 100644
> --- a/src/sinfo/sinfo.c
> +++ b/src/sinfo/sinfo.c
> @@ -736,6 +736,9 @@ static bool _match_node_data(sinfo_data_t *sinfo_ptr, 
> node_info_t *node_ptr)
>  {
> uint32_t tmp = 0;
>
> +   if (params.node_flag)
> +   return false;
> +
> if (params.match_flags.hostnames_flag &&
> (hostlist_find(sinfo_ptr->hostnames,
>node_ptr->node_hostname) == -1))

Thanks for looking into the issue.  I'm sure there are good reasons to
want one line per node, but equally I thought the aggregated view was
quite useful even though I've only got just over 100 nodes.  Surely
those with many thousands of nodes would like the option of having a
more compact view.  Or do they obtain similar information in a
completely different way?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Output of 'sinfo -Nel' not aggregated

2016-06-30 Thread Loris Bennett

Hi,

With version slurm 15.08.8, the node-oriented output of 'sinfo' is not
longer aggregated.  Instead I get a line for each node, even if the data
for multiple nodes are the same, e.g.

$ sinfo -Nel
Thu Jun 30 09:28:43 2016
NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
FEATURES REASON  
gpu01  1   gpudown   122:6:1  180000  1   
(null) Not responding  
node0011  test   idle~   122:6:1  420000  1  
ram48gb none
node0021  test   idle~   122:6:1  420000  1  
ram48gb none
node0031 main*   mixed   122:6:1  180000  1  
ram24gb none
node0041 main*   mixed   122:6:1  180000  1  
ram24gb none
node0051 main*   mixed   122:6:1  180000  1  
ram24gb none
node0061 main*   mixed   122:6:1  180000  1  
ram24gb none

As I remember and as the man page indicates, this should be

NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
FEATURES REASON  
gpu01  1   gpudown   122:6:1  180000  1   
(null) Not responding  
node[001-002]  1  test   idle~   122:6:1  420000  1  
ram48gb none
node[003-006]  1 main*   mixed   122:6:1  180000  1  
ram24gb none

Is this a bug and, if so, has it already been fixed?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: License manager and interactively used licenses

2016-06-28 Thread Loris Bennett

"Loris Bennett" <loris.benn...@fu-berlin.de>
writes:

> "Loris Bennett" <loris.benn...@fu-berlin.de>
> writes:
>
>> Hi Roshan,
>>
>> Yes, you're right - this will work for us.  So the update tweaks the
>> number of licences available and presumably extends the reservation by
>> another 30 sec, so that you have essentially an infinite reservation
>> holding, at any given time, the currently available number of
>> licenses. Clever.
>>
>> Thanks again,
>>
>> Loris
>
> [snip (103 lines)]
>
> I have run into a problem setting up the initial reservation.
>
> How do I set it up just for licenses such that any user with any account
> can use it?  It seems that either 'Users' or 'Accounts' must be
> specified.

Never mind, I figured it out.  Not specifying 'Users' and specifying
'Accounts=root' works.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Timeout before resource becomes available

2016-06-13 Thread Loris Bennett

Hi,

One of our users was carrying out some tests and running some very short
jobs with a TimeLimit of 60s.  However, because one of the nodes had to
be booted, which takes a couple of minutes, the jobs were terminated
with TIMEOUT as the state.

I am aware that we can set BatchStartTimeout to a larger value, but
wouldn't it make more sense if the run-time for the job only started to
accumulate, once the slurmd on the node became available?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: How to get rid of "zombie" jobs?

2016-06-06 Thread Loris Bennett

Hello Steffen,

Steffen Grunewald
<steffen.grunew...@aei.mpg.de> writes:

> Hello all,
>
> I've got a rather newly setup cluster, which at the moment is completely idle
> ("squeue" doesn't return anything.)
>
> From the testing phases, a couple of now unused accounts and associations are
> left, which I'd like to get rid of:
>
> [root@login ~]# sacctmgr show assoc
>ClusterAccount   User  Partition Share GrpJobs   GrpTRES 
> GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES MaxTRESPerNode 
> MaxSubmit MaxWall   MaxTRESMins  QOS   Def QOS 
> GrpTRESRunMin 
> -- -- -- -- - --- - 
> - --- - --- - -- 
> - --- -  - 
> - 
> [...]
>clusterdefault   1 
>   
>  normal   
>clusterdefaulttom1 
>   
>  normal   
> [...]
> [root@login ~]# sacctmgr delete user name=tom account=default
>  Error with request: Job(s) active, cancel job(s) before remove
>   JobID = 15498  C = clusterA = defaultU = tom  
>   JobID = 15500  C = clusterA = defaultU = tom  
>   JobID = 15501  C = clusterA = defaultU = tom  
>   JobID = 15502  C = clusterA = defaultU = tom  
>   JobID = 15503  C = clusterA = defaultU = tom  
>   JobID = 15504  C = clusterA = defaultU = tom  
>   JobID = 15505  C = clusterA = defaultU = tom  
>   JobID = 15506  C = clusterA = defaultU = tom  
>   JobID = 15508  C = clusterA = defaultU = tom  
>   JobID = 15509  C = clusterA = defaultU = tom  
> [root@login ~]# scontrol show jobid -dd 15500
> slurm_load_jobs error: Invalid job id specified
> [root@login ~]# sacct -j 15500
>JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode 
>  -- -- -- -- --  
> 15500intel-test  partitiondefault 48RUNNING  0:0 
>
>
> Is there a "gold standard" way to repair this?

I don't think there is a "gold standard" for this.  You probably just
have to go into the database an fix it yourself.

A while ago I posted some code to fix anomalous jobs.  It was intended
to make the data plausible (e.g. by adding a missing completion date for
a job with status "RUNNING" which no longer exists), and not for
deleting jobs completely, but it might help:

https://groups.google.com/forum/#!msg/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Incorrect handling of non-ASCII characters

2016-06-03 Thread Loris Bennett

Hi Gary,

Gary Brown <gbr...@adaptivecomputing.com>
writes:

> Re: [slurm-dev] Re: Incorrect handling of non-ASCII characters 
>
> Another option is to use the alternative spelling of Schroedinger, which is
> perfectly acceptable German.


Personally I don't think it is acceptable - and I think that here in
Germany in most contexts it would be considered strange to replace
umlauts.  It might be OK in the area of HPC, but only because
expectations of user-friendliness are quite low.

In my view we are decades beyond the point where restricting characters
to those available in the 7bit ASCII set is acceptable.  Just image
Italian had become the dominant language in the USA instead of English -
Slurm might think your name is "Gari Brovvn".

In fact, I would prefer incorrect justification with umlauts to correct
justification without umlauts.


[snip (57 lines)]

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Incorrect handling of non-ASCII characters

2016-06-01 Thread Loris Bennett

Hi,

With Slurm 15.08.8, sreport does not handle non-ASCII characters in the
'Proper Name' column properly:

Top 3 Users 2016-05-31T00:00:00 - 2016-05-31T23:59:59 (86400 secs)
Use reported in TRES Minutes

  Cluster Login Proper Name Account Used   Energy 
- - --- ---   
  sorobanalbertEinstein physics   2304000 
  soroban erwinSchrödinger physics   2005690 
  sorobanwerner  Heisenberg physics   1396800 


The presence of the umlaut in 'Schrödinger' causes the name to be
justified incorrectly.  In addition, all the lines from the line with
the column names to the final line of data have an additional space at
the end of the line.

The terminal space is not much of a problem, but it would be nice if
justification problem could be fixed.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: More tasks than allocated CPUs

2016-05-24 Thread Loris Bennett

Hi Yaron,

Yaron Weitz <yar...@mail.huji.ac.il> writes:

> slurm-dev
>
> Hi,
>
> I'm new to the Slurm. I have been working with it for the past 7
> months.  We sometimes have a situation where a job generates more
> tasks than the number of CPU's allocated to it.
> I don't know if the cause is the code of the running job or something
> to do with the use or configuration of the Slurm.  We have a cluster
> of Ubuntu 14.04 servers and slurm-llnl version 2.6.5-1 from the Ubuntu
> repos.
>
> Thanks,
> Yaron

On our system this is usually a result of user error.  Particularly if
people don't use the environment variable ${SLURM_NTASKS} in their batch
scripts, they may end up requesting a number of cores, but passing a
different number to their MPI launcher for the number of processes to
start.

However, your version of Slurm is quite old, so it is conceivable that
you are being bitten by a probably long-fixed bug.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] BadConstraints - node list not recalculated

2016-05-24 Thread Loris Bennett

Hi,

The 'Reason' field for a pending job has changed from 'Priority' to
'BadConstraints'.  This seems to be because the status of one of the
nodes in the node list reported by 'scontrol show job' has changed to
'draining'.  The job itself just specifies the number of tasks required,
not specific nodes.

Shouldn't the scheduler just be able to replace the draining node with
another node in the projected node list?  This is happening with version
15.08.8.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Loris Bennett

Benjamin Redling
<benjamin.ra...@uni-jena.de> writes:

> On 05/17/2016 10:02, Loris Bennett wrote:
>> 
>> Benjamin Redling
>> <benjamin.ra...@uni-jena.de> writes:
>> 
>>> On 2016-05-13 05:58, Husen R wrote:
>>>> Does slurm provide feature to get command that being executed/will be
>>>> executed by running/pending jobs ?
>>>
>>> scontrol show --detail job 
>>> or
>>> scontrol show -d job 
>>>
>>> Benjamin
>> 
>> Which version does this? 15.08.8 just seems to show the 'Command' entry,
>> which is the file containing the actual command.
>
> An older one. I see. I made the mistake before (squeue -n ...):
> I assumed slurm commands/parameters don't change (all over the board).
>
> (Will I ever be able to depend on _any_ script I write, or any parameter
> I thought I knew?
> What other surprises will await me after an upgrade? And where are these
> major changes documented?
> Who thinks changing parameter semantics is a good idea?)
>
> This is not something were slurm shines.

I haven't really been bitten by such changes.  My main gripe with the
Slurm tools is the inconsistency of the interfaces, e.g. output columns:

  squeue -o " %.18i"
  sacct -o jobid%18

or selection according to nodes

  squeue -w node001
  sacct -N node001

This is obviously not a real problem, but it is a daily annoyance.  So
in that sense, I do think that changing the parameter semantics would be
a good idea, but only once and only if the options become harmonised
across all the tools!

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Loris Bennett

Benjamin Redling
<benjamin.ra...@uni-jena.de> writes:

> On 2016-05-13 05:58, Husen R wrote:
>> Does slurm provide feature to get command that being executed/will be
>> executed by running/pending jobs ?
>
> scontrol show --detail job 
> or
> scontrol show -d job 
>
> Benjamin

Which version does this? 15.08.8 just seems to show the 'Command' entry,
which is the file containing the actual command.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: General question - Where can I find Slurm docs in German?

2016-04-20 Thread Loris Bennett

Hi Brian,

Brian Gilmer <bfgil...@gmail.com> writes:

> General question - Where can I find Slurm docs in German? 

I don't think there is anything official, but I provide information in
both English and German for the system I am involved in running:

https://www.zedat.fu-berlin.de/HPC/SorobanQueueingSystem

The documentation not very extensive, although it is extended
occasionally, and is somewhat specific to our site, but it may be of
help to you.  Any mistakes in both the English and the German versions
are probably mine.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] squeue shows job running on node in state 'idle~'

2016-04-08 Thread Loris Bennett

Hi,

I have a job shown as running by 'squeue':

$ squeue -w node086
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   1234567  main   abcdef user1234  R 10-09:32:34  1 node086

However with 'sinfo' I can see that the node has been powered off:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test up3:00:00  2  idle~ node[001-002]
main*up 14-00:00:0  1  idle~ node086
...

This is the second time I have seen this phenomenon since updating to
version 15.08.8 a month ago.

Is this a bug or can this just happen if a job just crashes in an odd
enough way?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Fair share priority stopped working

2016-04-04 Thread Loris Bennett

Hi Nirmal,

Nirmal Seenu <n19...@gmail.com> writes:

> Fair share priority stopped working 
>
> Hi,
>
> I just noticed that the fair share priority stopped working in the last few 
> days
> and would appreciate any help in debugging this problem. I am running Slurm
> version 14.11.11 on Centos 7.2. 
>
> I am not sure when it stopped working but the only thing that I changed was
> PriorityDecayHalfLife=00:10:00 and PriorityUsageResetPeriod=WEEKLY. The
> following is the current values that I have set -- the initial value when fair
> share was working fine:
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=00:01:00

I would think that this value for PriorityDecayHalfLife is much too
short.  The CPU-time usage will decay very rapidly, so there will be the
contribution to the priority will be the similar both for heavy users
and those who don't consume much CPU-time.  I would guess you want a
value more like a single-digit number of days.

Cheers,

Loris

> PriorityUsageResetPeriod=NONE
> PriorityWeightFairshare=1
> PriorityWeightAge=100
> PriorityWeightPartition=1
> PriorityWeightJobSize=1
> PriorityMaxAge=7-0
>
> Everything seems to be fine on the database side:
>
> [root@tcs-bcm-1 ~]# sacctmgr list assoc tree
> format=cluster,account,user,fairshare
> Cluster Account User Share 
> --  -- - 
> slurm_clu+ root 1 
> slurm_clu+ root root 1 
> slurm_clu+ dev 50 
> slurm_clu+ dev c 1 
> slurm_clu+ r 1 
> slurm_clu+ r a2 1 
> slurm_clu+ r a1 1 
> slurm_clu+ r b 1 
> slurm_clu+ r d 1 
> slurm_clu+ r e 1 
> slurm_clu+ r j2 1 
> slurm_clu+ r j1 1 
> slurm_clu+ r m4 1 
> slurm_clu+ r m3 1 
> slurm_clu+ r m2 1 
> slurm_clu+ r m1 1 
> slurm_clu+ r r 1 
> slurm_clu+ r s 1 
> slurm_clu+ r t 1 
>
> [root@tcs-bcm-1 ~]# sprio -l | head
> JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE
> 1378456 j1 385 10 0 276 100 0 0
> 1378457 j1 385 10 0 276 100 0 0
> 1378458 j1 385 10 0 276 100 0 0
>
> Relevant log entry when I restarted both slurmdbd and slurm:
> /var/log/slurmctld:
> [2016-03-22T17:47:13.533] Running as primary controller
> [2016-03-22T17:47:13.533] Registering slurmctld at port 6817 with slurmdbd.
> [2016-03-22T17:47:17.817]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
>
> /var/log/slurmdbd:
> [2016-03-22T17:46:53.733] Accounting storage MYSQL plugin loaded
> [2016-03-22T17:46:53.735] error: chdir(/var/log): Permission denied
> [2016-03-22T17:46:53.735] chdir to /var/tmp
> [2016-03-22T17:46:53.744] slurmdbd version 14.11.11 started
> [2016-03-22T17:46:57.010] DBD_JOB_START: cluster not registered
> [2016-03-22T17:47:01.910] DBD_STEP_START: cluster not registered
>
> Thanks in advance for your help!
> Nirmal
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: What cluster provisioning system do you use?

2016-03-15 Thread Loris Bennett

Hi Bjørn-Helge,

Bjørn-Helge Mevik <b.h.me...@usit.uio.no>
writes:

> I apologize for the slightly off-topic subject, but I could not think of
> a better forum to ask.  If you know of a more proper place to ask this,
> I'd be happy to know about it.
>
> We are currently in the design fase for a new cluster that is going to
> be set up next year.  We have so far used Rocks (on top of CentOS) for
> cluster provisioning.  However, Rocks don't support CentOS >= 7, and it
> doesn't look like it will in the near future.  Also for other reasons,
> we are looking for alternatives to Rocks.
>
> So, what are you using for cluster provisioning?
>
> - Rocks?
> - A different provisioning tool?
> - A locally developed solution?

We currently use Bright Cluster Manager, but are looking to move away
from this due to cost, lack of an update path from our current set-up,
and the fact that the integration with Slurm locked us to version 2.2.7
for a long time until we decided to do without the integration and
installed an up-to-date version.

I am currently setting up a test cluster and shall be looking at

- Warewulf
- DRBL
- maybe xCat

I would also be interested in other options.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Exporting environment variables by default?

2016-03-11 Thread Loris Bennett

Hi,

I'm using 15.08.8.  Am I correct in thinking that environment variables
which are to be evaluated by a job must be passed via sbatch's option
'--export' and that it is not possible to define variables centrally
within the Slurm configuration?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: User education tools for fair share

2016-03-01 Thread Loris Bennett

Hi Chris,

Christopher Samuel <sam...@unimelb.edu.au>
writes:

> Hi folks,
>
> We've just migrated to fairshare and one of the things we've been
> puzzling over is how to show users what their fairshare status is.
>
> With quotas it was pretty easy, we had a bar-graph showing how far
> through the quarter they were, and another bar-graph per project that
> showed the percentage of quota burnt so far this quarter.
>
> After 6 years of running like that it's hurting our heads to think
> differently about how to display it.
>
> It's also complicated as we are using Fair Tree (thanks Ryan et. al!)
> and so we think we should show users their priorities back up the tree.
>
> I'm even wondering if we should not worry about showing them that and
> instead just educate them about the priority of queued jobs instead.
>
> How do other sites handle this?
>
> All the best,
> Chris

We use fairshare without Fair Tree and with all users having the same
number of shares.  Occasionally we have users complaining about the
system being unfair, particularly when other users are able to profit
from backfill.  The problem is that users often just look at the number
of jobs someone is able to run, regardless of the resources being used.

To help the user understand their current fairshare/priority status, I
usually point them to 'sprio', generally in the following incantation:

sprio -l | sort -nk3

to get the jobs sorted by priority.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Fw:Slurm question for help

2016-02-25 Thread Loris Bennett

温圣召 <wenshengz...@yeah.net> writes:

> Fw:Slurm question for help 
>
> Dear Sir/Madam:
>
> I'm using slurm to build a small cluster。my munge,slurmctl,slurmdbd,slurmd
> all run as root。
> I use srun submit jobs with --uid= opention.
> r...@yq01-sys-hic-k4007.yq01.baidu.com matrixMulCUBLAS]# srun 
> --comment=wsz_111 -
> -account="testAccount" -N1 --chdir=/home/wenshengzhao/ --uid=wenshengzhao .
> /testbatch --- is work ok
> ==
> wenshengz...@yq01-sys-hic-k4007.yq01.baidu.com matrixMulCUBLAS]$ srun -
> -comment=wsz_111 --account="testAccount" -N1 --chdir=/home/wenshengzhao/ -
> -uid=root ./testbatch -- can not work, error info as : srun: error: Unable
> to allocate resources: Invalid user id 
> ==
> t...@yq01-sys-hic-k4007.yq01.baidu.com root]$ srun --comment=wsz_111 -
> -account="testAccount" -N2 --chdir=/home/wenshengzhao/ --uid=wenshengzhao .
> /testbatch -- can not work, error info as: srun: error: Unable to
> allocate resources: Invalid user id
>
> how can I solve this problem?
>
> I am looking forward to your reply

Have you added the user 'wenshengzhao' to the accounting information?

If not, have a look at the "Database Configuration" section on the
following page

http://slurm.schedmd.com/accounting.html

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] squeue: Collapsing running array jobs?

2016-02-25 Thread Loris Bennett

Hi,

I'm using Slurm 15.08.4 and in the man page for 'squeue' it says

  -r, --array
 Display one job array element per line.  Without this
 option, the display will be optimized for use with job
 arrays (pending job array elements will be combined on one
 line of output with the array index values printed using a
 regular expression).

Is there any way of having *running* job array elements collapsed to a
single line per job?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Accounting

2016-02-16 Thread Loris Bennett

Hi Jeff,

Jeff White <jeff.wh...@wsu.edu> writes:

> I'm working on getting accounting set up on a new SLURM instance. The
> cluster is working, slurmdbd is running, database is configured, sacct
> spits out some job info, all appears to be working.  Good, I built a
> thing and it seems to work.  Now the hard part: what do I do with it?
>
> * What exactly is an "account" in SLURM speak?  We have well-defined
> groups already and I don't want my users to need to specify an account
> or anything of the such with their jobs.  What do I need to do (if
> anything) to have accounting use purely users and groups and no
> manually-defined "accounts"?

My understanding is that it is a collection of resource restrictions.
If you have well-defined groups, then an account will correspond to a
group.  The account model is, however, more general, because, say, one
person could run jobs in various projects which have all have different
CPU-time budgets and/or priorities.

However, I also just have research groups and they correspond 1-to-1
with my accounts.  The accounts are arranged in a hierarchy (via the
parent organisation property) which corresponds to the organigram of the
university institutes and departments.

If you are using fairshare, you then need to set the shares per entity
in the organigram.  As all our users are created equal, this means
adding a user to a group, incrementing the shares of the group,
incrementing the shares of the institute, and incrementing the shares of
the department.  When a user leaves the group, this obviously all has to
be done is reverse.  Because this is a bit of a chore and quite error
prone, we use a wrapper around sacctmgr to automate this which is
integrated into our user-lifecycle-management mechanism.

> * The whole JobComp explanation in the documentation isn't clear to
> me.  What does accounting to slurmdbd /not/ provide that setting
> JobComp to log elsewhere would?  Why can't slurmdbd be used for
> everything?

It can.

> Here's some parts of the config, let me know if you want more:
>
> # grep AccountingStorage /etc/slurm/slurm.conf
> #AccountingStorageEnforce=0
> AccountingStorageHost=slurm-p1n01.mgmt.kamiak.example.edu
> #AccountingStorageLoc=
> #AccountingStoragePass=
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> #AccountingStorageUser=
>
> # grep JobCompType /etc/slurm/slurm.conf
> #JobCompType=jobcomp/slurmdbd

If you are using

AccountingStorageType=accounting_storage/slurmdbd

my understanding is that you don't need to set JobComp, as this provides
only a subset of the data you get from accounting storage.

HTH

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Backfill parameters

2016-02-16 Thread Loris Bennett

Hi Ulf,

Ulf Markwardt <ulf.markwa...@tu-dresden.de>
writes:

> Dear all,
>
> I have a problem with a large reservation in a few hours, ~1700
> long-running jobs waiting to start afterwards and my short job (srun -t
> 1 hostname) with priority of 1 that would fill any gap...
>
>
> "sdiag" always shows a value of about 100 as "Last depth cycle" for
> backfilling. Does that mean that it only looks at the first 100 jobs?
> I thought, bf_continue should take care of this, so that the next
> backfilling test starts where the last has finished.
>
> At the moment we have 15.08.6 running with:
> SchedulerParameters=bf_interval=30,bf_max_job_test=2000,bf_window=7200,default_queue_depth=5000,bf_continue,sched_interval=120,defer
> (Some values might be too high for production, but I was desperate to ge
> my job running...)

Is your bf_window at least as large as the timelimit on the partition in
question?  If not, see the info about bf_window on the slurm.conf
manpage.

> Can anybody give me a hint on how to change this so that my low priority
> job gets scheduled?
>
> Thanks a lot,
> Ulf
>
> PS. As soon as I give this job a Nice=-200 it starts, but that is not
> the way I want it :-)

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Question mark in MaxRSS

2016-01-29 Thread Loris Bennett

Hi,

With version 15.08.4, 'sacct' gives me values of MaxRSS which contain
'16?': 

$ sacct -o jobid,maxrss,state -S 2015-01-29T09:00 -E 2015-01-29T10:00 -s CD
   JobID MaxRSS  State 
 -- -- 
612354  16?  COMPLETED 
612354.batch  13722480K  COMPLETED 
613334  16?  COMPLETED 
613334.batch179580K  COMPLETED 
613337  16?  COMPLETED 
613337.batch  8776K  COMPLETED 
613337.0   3772344K CANCELLED+ 

This also applies to jobs run under older versions of Slurm.  As far as
I recall the fields used to be empty, as they are when the option '-X'
is given:

$ sacct -o jobid,maxrss,state -S 2015-01-29T09:00 -E 2015-01-29T10:00 -s CD -X
   JobID MaxRSS  State 
 -- -- 
612354   COMPLETED 
613334   COMPLETED 
613337   COMPLETED 

Is this a bug in 'sacct' or do I have a local issue?

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: sreport/sacct: discrepancy between utilization and CPUTime

2016-01-22 Thread Loris Bennett

Hi Carlos,

Carlos Fenoy <mini...@gmail.com> writes:

> Re: [slurm-dev] sreport/sacct: discrepancy between utilization and CPUTime 
>
> Hi Loris,
>
> Can you check when did the job actually started or ended? It may be that the 
> job
> spans 2 days, and that is the reason the the sreport is reporting less time.

Yes, you are correct.  If I choose a start time before the jobs started,
I get the same result with sreport as I do with sacct:

$ sreport cluster UserUtilizationByAccount user=bsp start=2016-01-14T21:00:00 
end=2016-01-15T04:30:00 -t hours

Cluster/User/Account Utilization 2016-01-14T21:00:00 - 2016-01-15T04:59:59 
(28800 secs)
Use reported in TRES Hours

  Cluster Login Proper Name Account   Used Energy 
- - --- --- -- -- 
  soroban   bspBeispiel   agexample  6  0 

Thanks for the hint,

Loris

> Regards,
> Carlos
>
> On Fri, Jan 22, 2016 at 9:31 AM, Loris Bennett <loris.benn...@fu-berlin.de>
> wrote:
>
> Hi,
> 
> Using version 15.08.4 I am looking at the value 'Used' from sreport and
> comparing this with the corresponding 'CPUTime' from sacct:
> 
> $ sreport cluster UserUtilizationByAccount user=bsp start=2016-01-15 -t
> hours
> 
> 
>
> Cluster/User/Account Utilization 2016-01-15T00:00:00 - 2016-01-21T23:59:59
> (604800 secs)
> Use reported in TRES Seconds
> 
> 
>
> Cluster Login Proper Name Account Used Energy
> - - --- --- -- --
> soroban bsp Beispiel agexample 4 0
> 
> $ sacct -S 2016-01-15 -u bsp -o jobid,cputime,state
> JobID CPUTime State
>  -- --
> 954088 05:49:45 COMPLETED
> 954088.batch 05:49:45 COMPLETED
> 
> Rounding aside, why is the 'Used' value given by report lower than
> 'CPUTime' given by sacct?
> 
> Regards
> 
>     Loris
> 
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

[slurm-dev] sreport/sacct: discrepancy between utilization and CPUTime

2016-01-22 Thread Loris Bennett

Hi,

Using version 15.08.4 I am looking at the value 'Used' from sreport and
comparing this with the corresponding 'CPUTime' from sacct:

$ sreport cluster UserUtilizationByAccount user=bsp start=2016-01-15 -t hours

Cluster/User/Account Utilization 2016-01-15T00:00:00 - 2016-01-21T23:59:59 
(604800 secs)
Use reported in TRES Seconds

  Cluster Login Proper Name Account   Used Energy 
- - --- --- -- -- 
  soroban   bspBeispiel   agexample  4  0 

$ sacct -S 2016-01-15 -u bsp -o jobid,cputime,state
   JobIDCPUTime  State 
 -- -- 
954088 05:49:45  COMPLETED 
954088.batch   05:49:45  COMPLETED 

Rounding aside, why is the 'Used' value given by report lower than
'CPUTime' given by sacct?

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


  1   2   >