Re: [slurm-users] Mixing GPU Types on Same Node

2023-04-03 Thread Yair Yarom
Hi,

With regards to 2. If you're using AccountingStorageTres, I think you can
specify each gres/gpu: to be monitored in addition to the generic
gres/gpu. And then have for all accounts "GrpTRES=gres/gpu=0" so they won't
be able to use gres/gpu, but only gres/gpu:.

We haven't tried this, but it's been on our todo list for a while now. So
I'd like to know if it works :)


On Wed, 29 Mar 2023 at 21:31,  wrote:

> Hello,
>
>
>
> Apologies if this is in the docs but I couldn’t find it anywhere.
>
>
>
> I’ve been using Slurm to run a small 7-node cluster in a research lab for
> a couple of years now (I’m a PhD student). A couple of our nodes have
> heterogenous GPU models. One in particular has quite a few: 2x NVIDIA
> A100s, 1x NVIDIA 3090, 2x NVIDIA GV100 w/ NVLink, 1x AMD MI100, 2x AMD
> MI200. This makes things a bit challenging but I need to work with what I
> have.
>
>
>
>1. I’ve only been able to set this up previously on Slurm 20.02 by
>“ignoring” the AMDs and just specifying the NVIDIA GPUs. That worked when
>we had one or two people using the AMD GPUs and they could coordinate
>between themselves. But now, we have more people interested. I’m upgrading
>Slurm to 23.02 in hopes that might fix some of the challenges, but
>should this be possible? Ideally I would like to have AutoDetect=nvml
>and AutoDetect=rsmi both on. If it’s not I’ll shuffle GPUs around to
>make this node NVIDIA-only.
>2. I want everyone to allocate GPUs with --gpus=: instead
>of --gpus=, so they don’t “block” a nice GPU like an A100 when
>they really wanted any-old GPU on the machine like a GV100 or 3090. Can I
>force people to specify a GPU type and not just a count? This is especially
>important if I’m mixing AMDs and NVIDIAs on the same node. If not, can I
>specify the “order” in which I want GPUs to be scheduled if they don’t
>specify a type (so they get handed out from least-powerful to most-powerful
>if people don’t care)?
>
>
>
> Any help and/or advice here is much appreciated. Slurm has been amazing
> for our lab (albeit challenging to setup at first) and I want to get
> everything dialed before I graduate :D .
>
>
>
> Thanks,
>
> -Collin
>


-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Regarding Multi-Cluster Accounting Information

2023-03-16 Thread Yair Yarom
Hi,

I'm not sure what you mean - what type of accounting and between which ways?
You need to create each account per cluster you want it in, and you need to
add the users to the account in the relevant clusters separately.

We use different clusters as they operate differently with regards to
limits and pricing. We use a single database because otherwise it will be
very difficult to manage (we have a local framework to manage users within
the clusters).
For reporting we have various scripts that collect all the data and we
analyze it outside of slurm, though for general reporting sreport is quite
handy.


On Wed, 15 Mar 2023 at 14:55, Shaghuf Rahman  wrote:

> Hi Yair,
>
> Thank you for clarification.
>
> Could you please tell me which way is better for accounting related
> reports.
>
> Thanks & Regards,
> Shaghuf
> On Wed, 15 Mar 2023 at 15:08, Yair Yarom  wrote:
>
>> Hi,
>>
>> We have several clusters on the same database. There are some entities
>> which are per cluster and some which are per database.
>> accounts - per cluster (you can have same account name with a different
>> account hierarchy, and different limits per cluster)
>> association - per cluster
>> qos - per database (we prefix our qos name with the cluster name to
>> distinguish).
>> partitions/nodes - per cluster (not really in the database though)
>> users - I'm not sure. On one hand an admin user is an admin for all
>> clusters. On the other hand, a user can have a different default account
>> per cluster.
>>
>> HTH,
>>
>>
>> On Tue, 14 Mar 2023 at 13:26, Shaghuf Rahman  wrote:
>>
>>> Hi,
>>>
>>> I tried adding the 2 individual account in cluster A and ClusterB
>>> respectively
>>> and 1 account which is added to both the cluster
>>>
>>> # sacctmgr show user cluster=alpha
>>>   User   Def Acct Admin
>>> -- -- -
>>>user1 alpha_grp None
>>>user2 test  None
>>>user3 beta_grp  None
>>>   root   root Administ+
>>> # sacctmgr show user cluster=beta
>>>   User   Def Acct Admin
>>> -- -- -
>>>user1 beta_grp  None
>>>user2 test  None
>>>user3 beta_grp  None
>>>   root   root Administ+
>>>
>>> So my question is does the account should be available on both the
>>> cluster or should it be unique accounts on both the cluster.
>>>
>>> Regards,
>>> Shaghuf
>>>
>>> On Tue, Mar 14, 2023 at 11:46 AM Shaghuf Rahman 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to slurm. I am setting up a multi cluster environment. I  have
>>>> 1 small doubt with respect to the user accounting. My setup will look like
>>>> below:
>>>> Cluster name A: *Alpha* (Slurmctld)
>>>> Cluster name B: *Beta* (Slurmctld)
>>>> Both controllers are pointing to the same database server.
>>>> My slurm accounting name is *hpc_acc_db*.
>>>> I have added unique users as:
>>>> user1 alpha_grp (belongs to alpha cluster).
>>>> user1 beta_grp {belongs to beta c.luster).
>>>>
>>>> My question is if this accounting should be unique in both the clusters
>>>> or it should be 2 different entries mentioned above.
>>>> do we need to add user as
>>>> sacctmgr add user user1 account=alpha_grp cluster=Alpha,Beta  or
>>>> it should be different like:
>>>> sacctmgr add user user1 account=alpha_grp cluster=Alpha
>>>> sacctmgr add user1 account=beta_grp cluster=Beta
>>>>
>>>> Please let me know in case of any additional information.
>>>>
>>>> Regards,
>>>> Shaghuf Rahman
>>>>
>>>
>>
>> --
>>
>>   /|   |
>>   \/   | Yair Yarom | System Group (DevOps)
>>   []   | The Rachel and Selim Benin School
>>   [] /\| of Computer Science and Engineering
>>   []//\\/  | The Hebrew University of Jerusalem
>>   [//  \\  | T +972-2-5494522 | F +972-2-5494522
>>   //\  | ir...@cs.huji.ac.il
>>  //|
>>
>>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Regarding Multi-Cluster Accounting Information

2023-03-15 Thread Yair Yarom
Hi,

We have several clusters on the same database. There are some entities
which are per cluster and some which are per database.
accounts - per cluster (you can have same account name with a different
account hierarchy, and different limits per cluster)
association - per cluster
qos - per database (we prefix our qos name with the cluster name to
distinguish).
partitions/nodes - per cluster (not really in the database though)
users - I'm not sure. On one hand an admin user is an admin for all
clusters. On the other hand, a user can have a different default account
per cluster.

HTH,


On Tue, 14 Mar 2023 at 13:26, Shaghuf Rahman  wrote:

> Hi,
>
> I tried adding the 2 individual account in cluster A and ClusterB
> respectively
> and 1 account which is added to both the cluster
>
> # sacctmgr show user cluster=alpha
>   User   Def Acct Admin
> -- -- -
>user1 alpha_grp None
>user2 test  None
>user3 beta_grp  None
>   root   root Administ+
> # sacctmgr show user cluster=beta
>   User   Def Acct Admin
> -- -- -
>user1 beta_grp  None
>user2 test  None
>user3 beta_grp  None
>   root   root Administ+
>
> So my question is does the account should be available on both the
> cluster or should it be unique accounts on both the cluster.
>
> Regards,
> Shaghuf
>
> On Tue, Mar 14, 2023 at 11:46 AM Shaghuf Rahman  wrote:
>
>> Hi,
>>
>> I am new to slurm. I am setting up a multi cluster environment. I  have 1
>> small doubt with respect to the user accounting. My setup will look like
>> below:
>> Cluster name A: *Alpha* (Slurmctld)
>> Cluster name B: *Beta* (Slurmctld)
>> Both controllers are pointing to the same database server.
>> My slurm accounting name is *hpc_acc_db*.
>> I have added unique users as:
>> user1 alpha_grp (belongs to alpha cluster).
>> user1 beta_grp {belongs to beta c.luster).
>>
>> My question is if this accounting should be unique in both the clusters
>> or it should be 2 different entries mentioned above.
>> do we need to add user as
>> sacctmgr add user user1 account=alpha_grp cluster=Alpha,Beta  or
>> it should be different like:
>> sacctmgr add user user1 account=alpha_grp cluster=Alpha
>> sacctmgr add user1 account=beta_grp cluster=Beta
>>
>> Please let me know in case of any additional information.
>>
>> Regards,
>> Shaghuf Rahman
>>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] NVIDIA MIG question

2022-11-17 Thread Yair Yarom
Can you request more than 7 single gpu jobs on the same node?
It could be that there's another limit you've encountered (e.g. memory or
cpu), or some other limit (in the account, partition, or qos)

On our setup we're limiting jobs to 1 gpu per job (via partition qos),
however we can use up all the MIGs with single gpu jobs.


On Wed, 16 Nov 2022 at 23:48, Groner, Rob  wrote:

> That does help, thanks for the extra info.
>
> If I have two separate GPU cards in the node, and I setup 7 MIGs on each
> card, for a total of 14 MIG "gpus" in the node...then, SHOULD I be able to
> salloc requesting, say 10 GPUs (7 from 1 card, 3 from the other)?  Because
> I can't.
>
> I can request up to 7 just fine.  When I request more than that, it adds
> in other nodes to try to give me that, even though there are theoretically
> 14 on the one node.  When I ask for 8, it gives me 7 from t-gc-1202 and
> then 1 from t-gc-1201.  When I ask for 10, then it fails because it can't
> give me 10 without using 2 cards in one node.
>
>
> [rug262@testsch ~ ]# sinfo -o "%20N  %10c  %10m  %25f  %50G "
> NODELIST  CPUSMEMORY  AVAIL_FEATURES
> GRES
> t-gc-1201 48  358400  3gc20gb
>  gpu:nvidia_a100_3g.20gb:4(S:0)
> t-gc-1202 48  358400  1gc5gb
> gpu:nvidia_a100_1g.5gb:14(S:0)
>
>
> [rug262@testsch (RC) ~] salloc --gpus=10 --account=1gc5gb
> --partition=sla-prio
> salloc: Job allocation 5015 has been revoked.
> salloc: error: Job submit/allocate failed: Requested node configuration is
> not available
>
>
> Rob
>
> --
> *From:* slurm-users  on behalf of
> Yair Yarom 
> *Sent:* Wednesday, November 16, 2022 3:48 AM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] NVIDIA MIG question
>
> You don't often get email from ir...@cs.huji.ac.il. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
> Hi,
>
> From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So
> you can have 14 jobs each using a different MIG.
> However (unless something has changed in the past year), due to nvidia
> limitations, a single process can't access more than one MIG simultaneously
> (this is unrelated to Slurm). So while you can have a user request a Slurm
> job with 2 gpus (MIGs), they'll have to run two distinct processes within
> that job in order to utilize those two MIGs.
>
> HTH,
>
>
> On Tue, 15 Nov 2022 at 23:42, Laurence  wrote:
>
> Hi Rob,
>
>
> Yes, those questions make sense. From what I understand, MIG should
> essentially split the GPU so that they behave as separate cards. Hence two
> different users should be able to use two different MIG instances at the
> same time and also a single job could use all 14 instances. The result you
> observed suggests that MIG is a feature of the driver i.e lspci shows one
> device but nvidia-smi shows 7 devices.
>
>
> I haven't played around with this myself in slurm but would be interested
> to know the answers.
>
>
> Laurence
>
>
> On 15/11/2022 17:46, Groner, Rob wrote:
>
> We have successfully used the nvidia-smi tool to take the 2 A100's in a
> node and split them into multiple GPU devices.  In one case, we split the 2
> GPUS into 7 MIG devices each, so 14 in that node total, and in the other
> case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.
>
> From our limited testing so far, and from the "sinfo" output, it appears
> that slurm might be considering all of the MIG devices on the node to be in
> the same socket (even though the MIG devices come from two separate
> graphics cards in the node).  The sinfo output says (S:0) after the 14
> devices are shown, indicating they're in socket 0.  That seems to be
> preventing 2 different users from using MIG devices at the same time.  Am I
> wrong that having 14 MIG gres devices show up in slurm should mean that, in
> theory, 14 different users could use one at the same time?
>
> Even IF that doesn't workif I have 14 devices spread across 2 physical
> GPU cards, can one user utilize all 14 for a single job?  I would hope that
> slurm would treat each of the MIG devices as its own separate card, which
> would mean 14 different jobs could run at the same time using their own
> particular MIG, right?
>
> Do those questions make sense to anyone?  
>
> Rob
>
>
>
>
> --
>
>   /|   |
>   \/   | Yair Yarom | System Group (DevOps)
>   []   | The Rachel and Selim Benin School
>   [] /\| of Computer Science and Engineering
>   []//\\/  | The Hebrew University of Jerusalem
>   [//  \\  | T +972-2-5494522 | F +972-2-5494522
>   //\  | ir...@cs.huji.ac.il
>  //|
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] NVIDIA MIG question

2022-11-16 Thread Yair Yarom
Hi,

>From what we observed, Slurm sees the MIGs each as a distinct gres/gpu. So
you can have 14 jobs each using a different MIG.
However (unless something has changed in the past year), due to nvidia
limitations, a single process can't access more than one MIG simultaneously
(this is unrelated to Slurm). So while you can have a user request a Slurm
job with 2 gpus (MIGs), they'll have to run two distinct processes within
that job in order to utilize those two MIGs.

HTH,


On Tue, 15 Nov 2022 at 23:42, Laurence  wrote:

> Hi Rob,
>
>
> Yes, those questions make sense. From what I understand, MIG should
> essentially split the GPU so that they behave as separate cards. Hence two
> different users should be able to use two different MIG instances at the
> same time and also a single job could use all 14 instances. The result you
> observed suggests that MIG is a feature of the driver i.e lspci shows one
> device but nvidia-smi shows 7 devices.
>
>
> I haven't played around with this myself in slurm but would be interested
> to know the answers.
>
>
> Laurence
>
>
> On 15/11/2022 17:46, Groner, Rob wrote:
>
> We have successfully used the nvidia-smi tool to take the 2 A100's in a
> node and split them into multiple GPU devices.  In one case, we split the 2
> GPUS into 7 MIG devices each, so 14 in that node total, and in the other
> case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.
>
> From our limited testing so far, and from the "sinfo" output, it appears
> that slurm might be considering all of the MIG devices on the node to be in
> the same socket (even though the MIG devices come from two separate
> graphics cards in the node).  The sinfo output says (S:0) after the 14
> devices are shown, indicating they're in socket 0.  That seems to be
> preventing 2 different users from using MIG devices at the same time.  Am I
> wrong that having 14 MIG gres devices show up in slurm should mean that, in
> theory, 14 different users could use one at the same time?
>
> Even IF that doesn't workif I have 14 devices spread across 2 physical
> GPU cards, can one user utilize all 14 for a single job?  I would hope that
> slurm would treat each of the MIG devices as its own separate card, which
> would mean 14 different jobs could run at the same time using their own
> particular MIG, right?
>
> Do those questions make sense to anyone?  
>
> Rob
>
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] gres.conf and select/cons_res plugin

2022-09-14 Thread Yair Yarom
Hi,

I don't remember the exact details, but we started with cons_res a while
back, and at one of the upgrades we moved to cons_tres which was newer and
supported more options (I don't remember which, but I think gpus were
generally supported with cons_res). I don't think we lost any features when
we switched.

Indeed you need to look at your version's documentation, e.g. cons_tres
doesn't appear in:
https://slurm.schedmd.com/archive/slurm-17.11.0/cons_res.html
but appears in
https://slurm.schedmd.com/cons_res.html

And from the latter:
The Consumable Trackable Resources (*cons_tres*) plugin provides all the
same functionality provided by the Consumable Resources (*cons_res*)
plugin. It also includes additional functionality specifically related to
GPUs.



On Wed, 14 Sept 2022 at 11:45, Ole Holm Nielsen 
wrote:

> Please note that the on-line Slurm documentation refers to version 22.05
> (the latest version)!  For your outdated version 17.x you will have to
> find the old documentation.
>
> Of course, upgrading to 22.05 is very strongly recommended!  Please note
> that you must upgrade by no more than 2 major releases at a time!!  See
> some notes in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> /Ole
>
> On 9/13/22 23:04, Patrick Goetz wrote:
> > I think reading the documentation is making me more confused; maybe this
> > has to do with version changes.  My current slurm cluster is using
> version
> > 17.x
> >
> > Looking at the man page for gres.conf
> > (https://slurm.schedmd.com/gres.conf.html)  I see this:
> >
> > NOTE: Slurm support for gres/[mps|shard] requires the use of the
> > select/cons_tres plugin.
> >
> > On my current (inherited) Slurm cluster we have:
> >
> >SelectType=select/cons_res
> >
> > but users are primarily using GPU resources, so I know Gres is working.
> > Why then is select/cons_tres required?
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-25 Thread Yair Yarom
I hope UsePAM won't get deprecated. I can understand the dangers, and
indeed to use it for limits seems weird (nowadays), but it's a nice hook to
have and we use it for other purposes: pam_setquota for /tmp quota per
user; Setting the per user /run/user/ directory (usually systemd sets this
up, but systemd doesn't play nicely with slurm); Fixing some cgroup mess we
have in our system; And calling pam_loginuid.

For a different solution - maybe calling keyctl in a TaskProlog can solve
this issue.



On Thu, 25 Aug 2022 at 12:37, Ole Holm Nielsen 
wrote:

> On 8/25/22 11:15, Matthias Leopold wrote:
> > Thanks for the hint. I wasn't aware of UsePAM. At first it looks
> tempting,
> > but then I read some bug reports and saw that it's an "alternative way
> of
> > enforcing resource limits" and is considered an "older deprecated
> > functionality".
> >
> > https://bugs.schedmd.com/show_bug.cgi?id=4098
>
> Warning: Do NOT configure UsePAM=1 in slurm.conf (this advice can be found
> on the net).  See
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags
>
> /Ole
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-24 Thread Yair Yarom
Hi,

I think you should look at pam_keyinit and add it to the slurm pam (the one
used with the UsePAM configuration).
We currently don't do this, but it's on the todo list to check it out...
(so I'm not sure if it will work, or if it's the right way to do this).


On Tue, 23 Aug 2022 at 16:36, Matthias Leopold <
matthias.leop...@meduniwien.ac.at> wrote:

> Hi,
>
> I want to access the kernel "user" keyrings inside a Slurm job on a
> Ubuntu 20.04 node. I'm not an expert on keyrings (yet), I just
> discovered that inside a Slurm job a keyring for "user: invocation_id"
> is used, which seems to be shared across all users of the executing
> Slurm node (other users can access/destroy my keys).
>
> The structure in a session run from Slurm looks like this (when using
> cifscreds):
>
> Session Keyring
>
>   989278347 --alswrv  0 0  keyring: _ses
>
>   446567140 s-rv  0 0   \_ user: invocation_id
>
>   638050420 sw-v  35816 10513   \_ logon: cifs:d:itsc-test2
>
>
> The structure in a SSH session looks like this (when using cifscreds):
>
> Session Keyring
>
>   932177825 --alswrv   1000  1000  keyring: _ses
>
>   826996940 --alswrv   1000 65534   \_ keyring: _uid.1000
>
> 1006610690 sw-v   1000  1000   \_ logon: cifs:d:itsc-test2
>
>
> I researched about this invocation_id and found a section on
> "KeyringMode=" in systemd.exec man page, but that didn't really help me.
>
> Can you explain to me how it would be possible to get "private" keyrings
> inside a Slurm job on the executing node?
>
> thx
> Matthias
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] WTERMSIG 15

2021-12-01 Thread Yair Yarom
I guess they won't be killed, but having them there could cause other
issues. I.e. any limit that systemd places on the slurmd service will be
applied to the jobs as well, and probably cumulatively.
Do you use cgroup for the slurm resource management (the TaskPlugin)? If so
it means this is not working properly.
We have a lot of customization here, so I can't be sure what change you
need exactly. We have the default KillMode (control-group), and
Delegate=true.



On Tue, Nov 30, 2021 at 2:00 PM LEROY Christine 208562 <
christine.ler...@cea.fr> wrote:

> Hi,
>
>
>
> Thanks for your feedback.
>
> It seems we are in the 1st case, but then looking deeper: for SL7 node we
> didn’t encounter the problem thanks to this service configuration (*).
>
> So the solution seems to configure KillMode=process as mention there (**):
> we will still have jobs listed when doing a 'systemctl status
> slurmd.service', but they won’t be killed; is that right?
>
>
>
> Thanks in advance,
>
> Christine
>
> (**)
>
> https://slurm.schedmd.com/programmer_guide.html
>
> (*)
>
> grep -i killmode /lib/systemd/system/slurmd.service
>
> KillMode=process
>
>
>
> Instead of (for ubuntu nodes)
>
> KillMode=control-group
>
>
>
> *De :* slurm-users  *De la part de*
> Yair Yarom
> *Envoyé :* mardi 30 novembre 2021 08:50
> *À :* Slurm User Community List 
> *Objet :* Re: [slurm-users] WTERMSIG 15
>
>
>
> Hi,
>
>
>
> There were two cases where this happened to us as well:
>
> 1. The systemd slurmd.service wasn't configured properly, and so the jobs
> ran under the slurmd.slice. So by restarting slurmd, systemd will send a
> signal to all processes. You can check if this is the case with 'systemctl
> status slurmd.service' - the jobs shouldn't be listed there.
>
> 2. When changing the partitions, as jobs here are sent to most partitions
> by default, removing partitions or nodes from partitions might cause the
> jobs in the relevant partitions to be killed.
>
>
>
> HTH,
>
>
>
>
>
> On Mon, Nov 29, 2021 at 6:46 PM LEROY Christine 208562 <
> christine.ler...@cea.fr> wrote:
>
> Hello all,
>
>
>
> I did some modification in my slurm.conf and I’ve restarted the slurmctld
> on the master and then the slurmd on the nodes.
>
> During this process I’ve lost some jobs (*), curiously all these jobs were
> on ubuntu nodes .
>
> These jobs were ok with the consumed resources (**).
>
>
>
> Any Idea what could be the problem ?
>
> Thanks in advance
>
> Best regards,
>
> Christine Leroy
>
>
>
>
>
> (*)
>
> [2021-11-29T14:17:09.205] error: Node xxx appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
>
> [2021-11-29T14:17:10.162]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>
> [2021-11-29T14:17:42.223] _job_complete: JobId=4546 WTERMSIG 15
>
> [2021-11-29T14:17:42.223] _job_complete: JobId=4546 done
>
> [2021-11-29T14:17:42.224] _job_complete: JobId=4666 WTERMSIG 15
>
> [2021-11-29T14:17:42.224] _job_complete: JobId=4666 done
>
> [2021-11-29T14:17:42.236] _job_complete: JobId=4665 WTERMSIG 15
>
> [2021-11-29T14:17:42.236] _job_complete: JobId=4665 done
>
> [2021-11-29T14:17:46.072] _job_complete: JobId=4533 WTERMSIG 15
>
> [2021-11-29T14:17:46.072] _job_complete: JobId=4533 done
>
> [2021-11-29T14:17:59.005] _job_complete: JobId=4664 WTERMSIG 15
>
> [2021-11-29T14:17:59.005] _job_complete: JobId=4664 done
>
> [2021-11-29T14:17:59.006] _job_complete: JobId=4663 WTERMSIG 15
>
> [2021-11-29T14:17:59.007] _job_complete: JobId=4663 done
>
> [2021-11-29T14:17:59.021] _job_complete: JobId=4539 WTERMSIG 15
>
> [2021-11-29T14:17:59.021] _job_complete: JobId=4539 done
>
>
>
>
>
> (**)
>
> # sacct --format=JobID,JobName,ReqCPUS,ReqMem,Start,State,CPUTime,MaxRSS |
> grep -f /tmp/job15
>
> 4533  xterm1   16Gn 2021-11-24T16:31:32 FAILED
> 4-21:46:14
>
> 4533.batchbatch1   16Gn 2021-11-24T16:31:32  CANCELLED
> 4-21:46:14   8893664K
>
> 4533.extern  extern1   16Gn 2021-11-24T16:31:32  COMPLETED
> 4-21:46:11  0
>
> 4539  xterm   16   16Gn 2021-11-24T16:34:25 FAILED
> 78-11:37:04
>
> 4539.batchbatch   16   16Gn 2021-11-24T16:34:25  CANCELLED
> 78-11:37:04  23781384K
>
> 4539.e

Re: [slurm-users] WTERMSIG 15

2021-11-29 Thread Yair Yarom
18Gn 2021-11-26T17:22:12  COMPLETED
> 2-20:55:27  0
>
> 4711  xterm43Gn 2021-11-29T14:47:09
> FAILED   00:20:08
>
> 4711.batchbatch43Gn 2021-11-29T14:47:09
> CANCELLED   00:20:08 37208K
>
> 4711.extern  extern43Gn 2021-11-29T14:47:09
> COMPLETED   00:20:00  0
>
> 4714  deckbuild   10   30Gn 2021-11-29T14:51:46
> FAILED   00:05:20
>
> 4714.batchbatch   10   30Gn 2021-11-29T14:51:46
> CANCELLED   00:05:20  4036K
>
> 4714.extern  extern   10   30Gn 2021-11-29T14:51:46
> COMPLETED   00:05:10  0
>


-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Slurm Multi-cluster implementation

2021-11-01 Thread Yair Yarom
cpu limit using ulimit is pretty straightforward with pam_limits and
/etc/security/limits.conf. On some of the login nodes we have a cpu limit
of 10 minutes, so heavy processes will fail.

The memory was a bit more complicated (i.e. not pretty). We wanted that a
user won't be able to use more than e.g. 1G for all processes combined.
Using systemd we added the file
/etc/systemd/system/user-.slice.d/20-memory.conf which contains:
[Slice]
MemoryLimit=1024M
MemoryAccounting=true

But we also wanted to restrict swap usage and we're still on cgroupv1, so
systemd didn't help there. The ugly part comes with a pam_exec to a script
that updates the memsw limit of the cgroup for the above slice. The script
does more things, but the swap section is more or less:

if [ "x$PAM_TYPE" = 'xopen_session' ]; then
_id=`id -u $PAM_USER`
if [ -z "$_id" ]; then
exit 1
fi
if [[ -e
/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes
]]; then
swap=$((1126 * 1024 * 1024))
echo $swap >
/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes
fi
fi


On Sun, Oct 31, 2021 at 6:36 PM Brian Andrus  wrote:

> That is interesting to me.
>
> How do you use ulimit and systemd to limit user usage on the login nodes?
> This sounds like something very useful.
>
> Brian Andrus
> On 10/31/2021 1:08 AM, Yair Yarom wrote:
>
> Hi,
>
> If it helps, this is our setup:
> 6 clusters (actually a bit more)
> 1 mysql + slurmdbd on the same host
> 6 primary slurmctld on 3 hosts (need to make sure each have a distinct
> SlurmctldPort)
> 6 secondary slurmctld on an arbitrary node on the clusters themselves.
> 1 login node per cluster (this is a very small VM, and the users are
> limited both to cpu time (with ulimit) and memory (with systemd))
> The slurm.conf's are shared on nfs to everyone in /path/to/nfs/ name>/slurm.conf. With symlink to /etc for the relevant cluster per node.
>
> The -M generally works, we can submit/query jobs from a login node of one
> cluster to another. But there's a caveat to notice when upgrading. slurmdbd
> must be upgraded first, but usually we have a not so small gap between
> upgrading the different clusters. This causes the -M to stop working
> because binaries of one version won't work on the other (I don't remember
> in which direction).
> We solved this by using an lmod module per cluster, which both sets the
> SLURM_CONF environment, and the PATH to the correct slurm binaries (which
> we install in /usr/local/slurm// so that they co-exists). So when
> the -M won't work, users can use:
> module load slurm/clusterA
> squeue
> module load slurm/clusterB
> squeue
>
> BR,
>
>
>
>
>
>
>
> On Thu, Oct 28, 2021 at 7:39 PM navin srivastava 
> wrote:
>
>> Thank you Tina.
>> It will really help
>>
>> Regards
>> Navin
>>
>> On Thu, Oct 28, 2021, 22:01 Tina Friedrich 
>> wrote:
>>
>>> Hello,
>>>
>>> I have the database on a separate server (it runs the database and the
>>> database only). The login nodes run nothing SLURM related, they simply
>>> have the binaries installed & a SLURM config.
>>>
>>> I've never looked into having multiple databases & using
>>> AccountingStorageExternalHost (in fact I'd forgotten you could do that),
>>> so I can't comment on that (maybe someone else can); I think that works,
>>> yes, but as I said never tested that (didn't see much point in running
>>> multiple databases if one would do the job).
>>>
>>> I actually have specific login nodes for both of my clusters, to make it
>>> easier for users (especially those with not much experience using the
>>> HPC environment); so I have one login node connecting to cluster 1 and
>>> one connecting to cluster 1.
>>>
>>> I think the relevant bits of slurm.conf Relevant config entries (if I'm
>>> not mistaken) on the login nodes are probably:
>>>
>>> The differences in the slurm config files (that haven't got to do with
>>> topology & nodes & scheduler tuning) are
>>>
>>> ClusterName=cluster1
>>> ControlMachine=cluster1-slurm
>>> ControlAddr=/IP_OF_SLURM_CONTROLLER/
>>>
>>> ClusterName=cluster2
>>> ControlMachine=cluster2-slurm
>>> ControlAddr=/IP_OF_SLURM_CONTROLLER/
>>>
>>> (where IP_OF_SLURM_CONTROLLER is the IP address of host cluster1-slurm,
>>> same for cluster2)
>>>
>>> And then the have common entries for the AccountingStorageHost:
>>>
>>> AccountingStorageHost=slurm-db-prod
>>> AccountingStorageBa

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-31 Thread Yair Yarom
se server (running slurmdbd), and then a
>> > SLURM controller for each cluster (running slurmctld) using that one
>> > central database, the '-M' option should work.
>> >
>> > Tina
>> >
>> > On 28/10/2021 10:54, navin srivastava wrote:
>> >  > Hi ,
>> >  >
>> >  > I am looking for a stepwise guide to setup multi cluster
>> > implementation.
>> >  > We wanted to set up 3 clusters and one Login Node to run the job
>> > using
>> >      > -M cluster option.
>> >  > can anybody have such a setup and can share some insight into
>> how it
>> >  > works and it is really a stable solution.
>> >  >
>> >  >
>> >  > Regards
>> >  > Navin.
>> >
>> > --
>> > Tina Friedrich, Advanced Research Computing Snr HPC Systems
>> > Administrator
>> >
>> > Research Computing and Support Services
>> > IT Services, University of Oxford
>> > http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
>> > http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>
>> >
>>
>> --
>> Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
>>
>> Research Computing and Support Services
>> IT Services, University of Oxford
>> http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
>>
>>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Long term archiving

2021-06-29 Thread Yair Yarom
Thanks Paul,

How many nodes/users do you have? Have you tried upgrading the database
between slurm versions? If so, how long did it take?

On Mon, Jun 28, 2021 at 5:53 PM Paul Edmon  wrote:

> We keep 6 months in our active database and then we archive and purge
> anything older than that.  The archive data itself is available for
> reimport and historical investigation.  We've done this when importing
> historical data into XDMod.
>
> -Paul Edmon-
> On 6/28/2021 10:43 AM, Yair Yarom wrote:
>
> Hi list,
>
> I was wondering if you could share your long term archiving practices.
>
> We currently purge and archive the jobs after 31 days, and keep the usage
> data without purging. This gives us a reasonable history, and a downtime of
> "only" a few hours on database upgrade. We currently don't load the
> archives into a secondary db.
>
> We now have a use-case which might require us to save job information for
> more than that, and we're considering how to do that.
>
> Thanks in advance,
>
>
> --
>
>   /|   |
>   \/   | Yair Yarom | System Group (DevOps)
>   []   | The Rachel and Selim Benin School
>   [] /\| of Computer Science and Engineering
>   []//\\/  | The Hebrew University of Jerusalem
>   [//  \\  | T +972-2-5494522 | F +972-2-5494522
>   //\  | ir...@cs.huji.ac.il
>  //|
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


[slurm-users] Long term archiving

2021-06-28 Thread Yair Yarom
Hi list,

I was wondering if you could share your long term archiving practices.

We currently purge and archive the jobs after 31 days, and keep the usage
data without purging. This gives us a reasonable history, and a downtime of
"only" a few hours on database upgrade. We currently don't load the
archives into a secondary db.

We now have a use-case which might require us to save job information for
more than that, and we're considering how to do that.

Thanks in advance,


-- 

  /|   |
  \/       | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] slurmd running on IBM Power9 systems

2021-06-27 Thread Yair Yarom
Hi,

If it helps we have slurm 19.05 running on power8 and we are ignoring these
messages for quite a while now.
I'm not sure what impact it has on the scheduler or the jobs, but we
generally don't play with the frequency anyway.


On Wed, Jun 23, 2021 at 7:16 PM Karl Lovink  wrote:

> Hello,
>
> I have compiled the version 20.11.7 for a IBM Power9 system running
> Ubuntu 18.04. I have slurmd running but in the slurmd.log a predominant
> error pops up. I did alreay some research but I cannot find a solution.
>
> The error is:
> [2021-06-23T18:02:01.550] error: all available frequencies not scanned
>
> Rest of the log file:
> [2021-06-23T17:33:39.021] slurmd version 20.11.7 started
> [2021-06-23T17:33:39.021] slurmd started on Wed, 23 Jun 2021 17:33:39 +0200
> [2021-06-23T17:33:39.022] CPUs=128 Boards=1 Sockets=2 Cores=16 Threads=4
> Memory=261562 TmpDisk=899802 Uptime=6263 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> [2021-06-23T18:02:01.550] error: all available frequencies not scanned
> [2021-06-23T18:02:01.550] error: all available frequencies not scanned
> [2021-06-23T18:02:01.550] error: all available frequencies not scanned
>
>
> Any idea how I can fix this problem?
>
> Regards,
> Karl
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Job flexibility with cons_tres

2021-02-09 Thread Yair Yarom
Hi,

We have a similar configuration, very heterogeneous cluster and cons_tres.
Users need to specify the CPU/memory/GPU/time, and it will schedule their
job somewhere. Indeed there's currently no guarantee that you won't be left
with a node with unusable GPUs because no CPUs or memory are available.

We use one partition with 100% of the nodes and a time limit of 2 days, and
a second partition with ~90% of the nodes and a limit of 7 days. This gives
shorter jobs a chance to run without waiting just for long jobs.

We also use weights for the nodes, such that smaller nodes (resource-wise)
will be selected first. This prevents smaller jobs from filling up the
larger nodes (unless previous smaller nodes are occupied).

HTH,
Yair.



On Mon, Feb 8, 2021 at 1:41 PM Ansgar Esztermann-Kirchner <
aesz...@mpibpc.mpg.de> wrote:

> Hello List,
>
> we're running a heterogeneous cluster (just x86_64, but a lot of
> different node types from 8 to 64 HW threads, 1 to 4 GPUs).
> Our processing power (for our main application, at least) is
> exclusively provided by the GPUs, so cons_tres looks quite promising:
> depending on the size of the job, request an appropriate number of
> GPUs. Of course, you have to request some CPUs as well -- ideally,
> evenly distributed among the GPUs (e.g. 10 per GPU on a 20-core, 2-GPU
> node; 16 on a 64-core, 4-GPU node).
> Of course, one could use different partitions for different nodes, and
> then submit individual jobs with CPU requests tailored to one such
> partition, but I'd prefer a more flexible approach where a given job
> could run on any large enough node.
>
> Is there anyone with a similar setup? Any config options I've missed,
> or do you have a work-around?
>
> Thanks,
>
> A.
>
> --
> Ansgar Esztermann
> Sysadmin Dep. Theoretical and Computational Biophysics
> http://www.mpibpc.mpg.de/grubmueller/esztermann
>


-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Using "Environment Modules" in a SLURM script

2021-01-24 Thread Yair Yarom
We also use the lmod system. We found that besides the user's shell, it
also depends on how you install it. I.e. need to be active for general
shell and not just for login shells (bashrc vs. profile). Also, for e.g.
/bin/sh, it might not read any init file at all.

As we might have different modules between different nodes and between the
nodes and the submission machine, we actually don't want the modules to
pass across from the submission node to the cluster. As such, we're using
here spank and TaskProlog plugins to reset the modules on execution. Users
can run the 'module load' in their script, but can also use '--module
' as srun/sbatch parameters which can be useful for scripts which
the module don't support.


On Fri, Jan 22, 2021 at 6:37 PM Peter Kjellström  wrote:

> On our slurm clusters the module system (Lmod) works without extra init
> in job scripts due to the environment-forwarding in slurm. "module" in
> the submitting context (in bash) on the login node is an "exported"
> function and as such makes it across.
>
> /Peter
>
> On Fri, 22 Jan 2021 10:41:06 +
> Gestió Servidors  wrote:
>
> > Hello,
> >
> > I use "Environment Modules" (http://modules.sourceforge.net/) in my
> > SLURM cluster. In my scripts I do need to add an explicit
> > "source /soft/modules-3.2.10/Modules/3.2.10/init/bash". However, in
> > several examples I have read about SLURM scripts, nobody comments
> > that. So, have I forgotten a parameter in SLURM to "capture"
> > environment variables into the script or is it a problem due to my
> > distribution (CentOS-7)???
> >
> > Thanks.
>
>
>

-- 

  /|   |
  \/   | Yair Yarom | System Group (DevOps)
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-02 Thread Yair Yarom
Hi,

We also noticed this. We eventually placed the max time on the
HealthCheckInterval (65535), and created a systemd.timer which runs the
scripts externally of slurm, with proper intervals and randomized delays.

Yair.

On Wed, Dec 2, 2020 at 9:03 AM  wrote:

> Hello,
>
>
>
> Our slurm cluster managed about 600+ nodes and I tested to set
> HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
> this to CYCLE shall cause slurm to “cycle through running on all compute
> nodes through the course of the HealthCheckInterval”. So I set
> “HealthCheckInterval = 600”, and expected the health check time point can
> be evenly distributed across the 600 seconds period.
>
> But the test result showed that the earliest checked node is at about
> 14:19:35, while the latest checked node is at about 14:20:39. A round of
> the health checks only distributed across 60+ seconds? And the previous
> checking round distributed from 14:08:10 to 14:09:26, it seems the
> HealthCheckInterval only control the time interval between two rounds, not
> the time range distributed by one round checkings.
>
> So did I mistake the description in conf’s manual? And is there any method
> can control the health check frequency in one round between different nodes?
>
>
>
> Thanks.
>


Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-23 Thread Yair Yarom
On Fri, Nov 20, 2020 at 12:11 AM Sebastian T Smith  wrote:

> Hi,
>
> We're setting GrpTRESMins on the account association and have NoDecay
> QOS's for different user classes.  All user associations with a
> GrpTRESMins-limited account are assigned a NoDecay QOS.  I'm not sure if
> it's a better approach... but it's an option.
>

If I follow correctly, your GrpTRESMins usage on the accounts will still
get decayed. From tests I ran here when running with a NoDecay QOS, the
GrpTRESMins of the account still gets decayed, while the GrpTRESMins of the
QOS doesn't.
So do you also have a GrpTRESMins on the QOS itself? And if so, why do you
need both on the QOS and on the account? or am I missing something?

Thanks,
Yair.


[slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-17 Thread Yair Yarom
Hi all,

We have around 50 accounts, each has its own GrpTRES limits. We want to add
another set of accounts (probably another 50) with different priority which
will have GrpTRESMins, such that users could "buy" TRES*minutes with higher
priority.

For that we require that the GrpTRESMins won't get decayed. We do want the
normal multifactor priority mechanism to work with decaying usage, and we
don't want to reset the usage of GrpTRESMins periodically.

Currently the only solution I found is to create a new QOS with NoDecay for
each such new account. As we also have multiple clusters on the same
database, this also requires a new QOS for each account for each cluster
(as QOS appears to be shared among clusters).

Is there a downside of adding many QOS? (besides the management headache).
Has anyone else done something similar and have some insights? or another
solution?

Thanks in advance,
Yair.


[slurm-users] Backfill fails to start jobs (when preemptable QOS is involved)

2020-11-15 Thread Yair Yarom
Hi list,

We have GrpTRES limits on all accounts which causes a lot of higher
priority jobs to stay in the queue due to limits. As such we rely heavily
on the backfill scheduler. We also have a special lower priority
preemptable QOS with no limits.

We've noticed that when the cluster is loaded, sending a non preemptable
but not highest priority job, will cause the backfill algorithm to fail to
start the job when it needs to kill preemptable jobs. The preemptable jobs
are killed, but the job doesn't start.

>From the logs, for job 3617065:
[2020-11-15T13:36:01.928] backfill test for JobId=3617065 Prio=680634
Partition=short
[2020-11-15T13:36:12.947] _preempt_jobs: preempted JobId=3616258 had to be
killed
[2020-11-15T13:36:12.953] _preempt_jobs: preempted JobId=3616259 had to be
killed
[2020-11-15T13:36:12.960] _preempt_jobs: preempted JobId=3616255 had to be
killed
[2020-11-15T13:36:12.966] _preempt_jobs: preempted JobId=3616256 had to be
killed
[2020-11-15T13:36:12.972] _preempt_jobs: preempted JobId=3616257 had to be
killed
[2020-11-15T13:36:12.973] backfill: planned start of JobId=3617065 failed:
Requested nodes are busy
[2020-11-15T13:36:12.973] JobId=3617065 to start at 2020-11-15T13:36:01,
end at 2020-11-15T15:36:00 on nodes dumfries-002 in partition short

Looking at job 3616258 which was preempted on time:
$ sacct -j 3616258 -ojobid,end,state
   JobID End  State
 --- --
3616258  2020-11-15T13:36:12  PREEMPTED
3616258.bat+ 2020-11-15T13:36:50  CANCELLED
3616258.ext+ 2020-11-15T13:36:13  COMPLETED

The job was preempted at 13:36:12, but the batch script was finished only
at 13:36:50. By then the backfill already gave up. The job will start in
one of the subsequence backfill cycles, but in some cases this can take
more than 30 minutes.

Is this intentional? i.e. that the backfill will preempt jobs on the first
cycle, and run the "real" job on the second (or later) cycle?
Has anyone else encountered this?

Our slurm is 19.05.1, with KillWait=30 (we want to keep this above 0),
CompleteWait=0, and the SchedulerFlags (which was changed numerous times in
the past weeks) currently includes:
batch_sched_delay=5
bf_busy_nodes
bf_continue
bf_interval=90
bf_max_job_test=2500
bf_max_job_user_part=30
bf_max_time=270
bf_min_prio_reserve=100
bf_window=30300
bf_yield_interval=500
default_queue_depth=2000
defer
kill_invalid_depend
max_rpc_cnt=150
preempt_strict_order
sched_interval=120
sched_min_interval=100

Thanks in advance,
Yair.


Re: [slurm-users] [External] How to detect Job submission by srun / interactive jobs

2020-05-19 Thread Yair Yarom
Hi,

We have here a job_submit_limit_interactive plugin that limits interactive
jobs and can force a partition for such jobs. It also limits the number of
concurrent interactive jobs per user by using the license system. It's
written in c, so compilation is required. It can be found in:
https://github.com/irush-cs/slurm-plugins

Note that while it depends on what exactly you're trying to avoid, users
can somewhat easily circumvent this by e.g. submitting an sbatch of a
jupyter notebook.

Best regards,
Yair.


On Mon, May 18, 2020 at 6:25 PM Florian Zillner  wrote:

> Hi Stephan,
>
> From the slurm.conf docs:
> ---
> BatchFlag
> Jobs submitted using the sbatch command have BatchFlag set to 1. Jobs
> submitted using other commands have BatchFlag set to 0.
> ---
> You can look that up e.g. with scontrol show job . I haven't
> checked though how to access that via lua. If you know, let me know, I'd be
> interested as well.
>
> Example:
> # scontrol show job 128922
> JobId=128922 JobName=sleep
>...
>JobState=RUNNING Reason=None Dependency=(null)
>Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>RunTime=00:00:54 TimeLimit=00:30:00 TimeMin=N/A
>
> Cheers,
> Florian
>
> -Original Message-
> From: slurm-users  On Behalf Of
> Stephan Roth
> Sent: Montag, 18. Mai 2020 16:04
> To: slurm-users@lists.schedmd.com
> Subject: [External] [slurm-users] How to detect Job submission by srun /
> interactive jobs
>
> Dear all,
>
> Does anybody know of a way to detect whether a job is submitted with
> srun, preferrably in job_submit.lua?
>
> The goal is to allow interactive jobs only on specific partitions.
>
> Any recommendation or best practice on how to handle interactive jobs is
> welcome.
>
> Thank you,
> Stephan
>
>

-- 

  /|   |
  \/   | Yair Yarom | Senior DevOps Architect
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Job with srun is still RUNNING after node reboot

2020-04-01 Thread Yair Yarom
I've checked it now, it isn't listed as a runaway job.

On Tue, Mar 31, 2020 at 5:24 PM David Rhey  wrote:

> Hi, Yair,
>
> Out of curiosity have you checked to see if this is a runaway job?
>
> David
>
> On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom  wrote:
>
>> Hi,
>>
>> We have an issue where running srun (with --pty zsh), and rebooting the
>> node (from a different shell), the srun reports:
>> srun: error: eio_message_socket_accept:
>> slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received
>> and hangs.
>>
>> After the node boots, the slurm claims that job is still RUNNING, and
>> srun is still alive (but not responsive).
>>
>> I've tried it with various configurations (select/linear,
>> select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none,
>> task/cgroup), with the same results. We're using 19.05.1.
>> Running with sbatch causes the job to be in the more appropriate
>> NODE_FAIL state instead.
>>
>> Anyone else encountered this? or know how to make the job state not
>> RUNNING after it's clearly not running?
>>
>> Thanks in advance,
>> Yair.
>>
>>
>
> --
> David Rhey
> ---
> Advanced Research Computing - Technology Services
> University of Michigan
>


[slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread Yair Yarom
Hi,

We have an issue where running srun (with --pty zsh), and rebooting the
node (from a different shell), the srun reports:
srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]:
Zero Bytes were transmitted or received
and hangs.

After the node boots, the slurm claims that job is still RUNNING, and srun
is still alive (but not responsive).

I've tried it with various configurations (select/linear, select/cons_tres,
jobacct_gather/linux, jobacct_gather/cgroup, task/none, task/cgroup), with
the same results. We're using 19.05.1.
Running with sbatch causes the job to be in the more appropriate NODE_FAIL
state instead.

Anyone else encountered this? or know how to make the job state not RUNNING
after it's clearly not running?

Thanks in advance,
Yair.


Re: [slurm-users] Slurm Perl API use and examples

2020-03-24 Thread Yair Yarom
I also haven't got along with the Perl API shipped with slurm. I got it to
work, but there were things missing.
Currently I have some wrapper functions for most of slurm commands, and a
general parsing function to slurm's common outputs (of scontrol, sacctmgr,
etc.).
Not in CPAN, but you can see it under in the cshuji::Slurm module in:
https://github.com/irush-cs/slurm-scripts/

I haven't checked it yet, but now with the slurm rest API, I think getting
the information should be simpler.

HTH,
Yair.


On Mon, Mar 23, 2020 at 10:27 PM Thomas M. Payerle  wrote:

> I was never able to figure out how to use the Perl API shipped with Slurm,
> but instead have written some wrappers around some of the Slurm commands
> for Perl.  My wrappers for the sacctmgr and share commands are available at
> CPAN:
> https://metacpan.org/release/Slurm-Sacctmgr
> https://metacpan.org/release/Slurm-Sshare
> (I have similar wrappers for a few other commands, but have not polished
> enough for CPAN release, but am willing to share if you contact me).
>
> On Mon, Mar 23, 2020 at 3:49 PM Burian, John <
> john.bur...@nationwidechildrens.org> wrote:
>
>> I have some questions about the Slurm Perl API
>> - Is it still actively supported? I see it's still in the source in Git.
>> - Does anyone use it? If so, do you have a pointer to some example code?
>>
>> My immediate question is, for methods that take a data structure as an
>> input argument, how does one define that data structure? In Perl, it's just
>> a hash, am I supposed to populate the keys of the hash by reading the
>> matching C structure in slurm.h? Or do I only need to populate the keys
>> that I care to provide a value for, and Slurm assigns defaults to the other
>> keys/fields? Thanks,
>>
>> --
>> John Burian
>> Senior Systems Programmer, Technical Lead
>> Institutional High Performance Computing
>> Abigail Wexner Research Institute, Nationwide Children’s Hospital
>>
>>
>>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
> 5825 University Research Park   (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
>


Re: [slurm-users] Hybrid compiling options

2020-03-01 Thread Yair Yarom
Hi,

We also have hybrid cluster(s).
We use the same nfsroot for all nodes, so technically everything is
installed everywhere. And we compile slurm once with everything needed.

Users can run "module load cuda" and/or "module load nvidia" to have access
to nvcc and nvidia's libraries (cuda and nvidia are manually installed here
as well), so they can compile gpu code, but it won't run on nodes with no
nvidia hardware.

The infiniband is that same, though we don't have hybrid clusters. I.e. one
cluster has IB, and one doesn't. But they all run the same binaries.

HTH,
Yair.



On Sat, Feb 29, 2020 at 5:24 PM  wrote:

> There are GPU plugins that won't be built unless you build on a node that
> has the Nvidia drivers installed.
>
> -Original Message-
> From: slurm-users  On Behalf Of
> Brian Andrus
> Sent: Friday, February 28, 2020 7:36 PM
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] Hybrid compiling options
>
> All,
>
> Wanted to reach out for input on how folks compile slurm when you have a
> hybrid cluster.
>
> Scenario:
>
> you have 4 node types:
>
> A) CPU only
> B) GPU Only
> C) CPU+IB
> D) GPU+IB
>
> So, you can compile slurm with/without IB support and/or with/without GPU
> support.
> Including either option creates a dependency when packaging (RPM based).
>
> So, do you compile different versions for the different node types or
> install the dependent packages on nodes that have no user (nvidia in
> particular here)?
>
> Generally, I have always added the superfluous packages, but wondered what
> the thoughts on that are.
>
> Brian Andrus
>
>
>
>
>


Re: [slurm-users] Problem with squeue reporting of GPUs in use

2020-02-25 Thread Yair Yarom
Hi,

I've also encountered this issue of the deprecated %b. I'm currently
parsing the output of "scontrol show jobs -dd" to see what was requested
(and which exact GPUs were allocated).

Hope this helps,
Yair.

On Mon, Feb 24, 2020 at 11:56 PM Venable, Richard (NIH/NHLBI) [E] <
venab...@nhlbi.nih.gov> wrote:

> I’m seeing a problem with GPU usage reporting via squeue in the 19.05.3
> release.
>
>
>
> I’ve been using a custom script to track GPUs in use, and had been relying
> on the ‘%b’ field of squeue -o formatting (which now seems to be
> undocumented) to capture usage requested via --gres option of sbatch.
> Unfortunately, besides apparently being deprecated, ‘%b’ does not report
> usage requested via the new --gpus option.
>
>
>
> I’ve tried several squeue -O option fields, but only ‘tres-alloc’ seems to
> consistently report GPU usage, independent of which sbatch option was used
> for the request.  The ‘tres-per-node’ field only reports usage requested by
> --gres, while ‘tres-per-job’ only reports usage requested by the  --gpus
> option.  Also, the -O formatting doesn’t put a single space between fields,
> a problem for longer job names or usernames, and messes up the field
> parsing of the output when two fields are run together.
>
>
>
> Our users like to know which partition has the most free GPUs, and right
> now my script is broken wrt. usage via the --gpus option.
>
>
>
> If there is no other option, I can probably parse the ‘tres-alloc’ field
> (it has more info than I need), but I’m looking for alternatives, or any
> information that might indicate the ‘tres-*’ fields are more consistent in
> the newer (.4 or .5) SLURM releases.
>
>
>
>
>
> BTW, sreport does a bad job of reporting GPU usage as well, in that the
> GRES/GPU total % for root in the account listing on a given cluster is
> always less than the % allocated in the utilization listing, sometime by a
> substantial amount.  The CPU usage is almost always the same in both
> sreport listings.
>
>
>
>
>
> --
>
> *Rick Venable*
>
> NIH/NHLBI/DIR/BBC
>
> Lab. of Membrane Biophysics MSC 5690
>
> Bldg. 12A Room 3053L
>
> Bethesda, MD  20892-5690   U.S.A.
>
>
>
>
>


Re: [slurm-users] good practices

2019-11-25 Thread Yair Yarom
Hi,

I'm not sure what queue time limit of 10 hours is. If you can't have jobs
waiting for more than 10 hours, than it seems to be very small for 8 hours
jobs.
Generally, a few options:
a. The --dependency option (either afterok or singleton)
b. The --array option of sbatch with limit of 1 job at a time (instead of
the for loop): sbatch --array=1-20%1
c. At the end of the script of each job, call the sbatch line of the next
job (this is probably the only option if indeed I understood the queue time
limit correctly).

And indeed, srun should probably be reserved for strictly interactive jobs.

Regards,
Yair.

On Mon, Nov 25, 2019 at 11:21 AM Nigella Sanders 
wrote:

>
> Hi all,
>
> I guess this is a simple matter but I still find it confusing.
>
> I have to run 20 jobs on our supercomputer.
> Each job takes about 8 hours and every one need the previous one to be
> completed.
> The queue time limit for jobs is 10 hours.
>
> So my first approach is serially launching them in a loop using srun:
>
>
> *#!/bin/bash*
> *for i in {1..20};do*
>
> *srun  --time 08:10:00  [options]*
>
> *done*
>
> However SLURM literature keeps saying that 'srun' should be only used for
> short command line tests. So that some sysadmins would consider this a bad
> practice (see this
> 
> ).
>
> My second approach switched to sbatch:
>
> * #!/bin/bash *
> *for i in {1..20};do*
> *sbatch  --time 08:10:00 [options]*
>
> *[polling to queue to see if job is done]*
> *done*
>
> But since sbatch returns the prompt I had to add code to check for job
> termination. Polling make use of sleep command and it is prone to race
> conditions so it doesn't like to sysadmins either.
>
> I guess there must be a --wait option in some recent versions of SLURM (see
> this ). Not yet available
> in our system though.
>
> Is there any prefererable/canonical/friendly way to do this?
> Any thoughts would be really appreciated,
>
> Regards,
> Nigella.
>
>
>


Re: [slurm-users] Environment modules

2019-11-24 Thread Yair Yarom
We also use lmod here. Very useful when different versions are needed or
for any software installations outside the distribution.

However, our environment is heterogenous, and the software modules might
have different versions/paths on different nodes. This creates an issue
when users run 'module load something' on the submission node, and then run
srun/sbatch and the wrong module is loaded (or just the wrong PATH is kept).
To solve this (in an overly complicated manner..) we have a taskprolog
plugin and a spank plugin that: a. resets the modules at submission, and b.
let the user add "--module " to the srun/sbatch so that the
appropriate module will be loaded on the nodes.

On Sat, Nov 23, 2019 at 11:55 AM William Brown 
wrote:

> Agreed, I have just been setting up Lmod on a national compute cluster
> where I am a non-privileged cluster and on an internal cluster where I have
> full rights.  It works very well, and Lmod can read theTcl module files
> also.  The most recent version has some extra features specially for
> Slurm.  An I use EasyBuild, saves hundreds of hours of effort.   I do quite
> often have to hand create simple module files for software with no
> EasyConfig but I can just copy the structure from module files created by
> EasyBuild so it has never been a great problem.
>
> The best bit of modules is being able to offer multiple conflicting
> versions of software like Java, Perl, R etc.
>
> William
>
> On Sat, 23 Nov 2019 at 03:57, Chris Samuel  wrote:
>
>> On 22/11/19 9:37 am, Mariano.Maluf wrote:
>>
>> > The cluster is operational but I need to install and configure
>> > environment modules.
>>
>> If you use Easybuild to install your HPC software then it can take care
>> of the modules too for you.  I'd also echo the recommendation from
>> others to use Lmod.
>>
>> Website: https://easybuilders.github.io/easybuild/
>> Documentation: https://easybuild.readthedocs.io/
>>
>> All the best,
>> Chris
>> --
>>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>>

-- 

  /|   |
  \/   | Yair Yarom | Senior DevOps Architect
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Yair Yarom
Hi,

Check the UnkillableStepProgram and UnkillableStepTimeout options in
slurm.conf.
We use it to drain the stuck nodes and mail us - as here, usually stuck
processes will require a reboot. As the drained strigger will never get
triggered, we also set a finished trigger for the next RUNNING job. That
trigger will either send us mail if there are only stuck processes, or
strigger --fini the next RUNNING job.

Yair.


On Tue, May 28, 2019 at 7:58 PM mercan  wrote:

> Hi;
>
> If you did not use the epilog script, you can set the epilog script to
> clean up all residues from the finished jobs:
>
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
>
> Ahmet M.
>
>
> 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
> > We use triggers for the obvious alerts but is that a way to make a
> trigger for nodes stuck in CG (completing) state?  Some user jobs, mostly
> Julia notebook can get hung in completing state is the user kills the
> running job or cancels it with cntrl.  When this happens we can have many
> many nodes stuck in CG.  Slurm 17.02.6.  Thanks!
> >
>
>

-- 

  /|   |
  \/   | Yair Yarom | Senior DevOps Architect
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] Setting up a separate timeout for interactive jobs

2018-09-20 Thread Yair Yarom
Hi,

We also have multiple partitions, but in addition we use a job submit
plugin to distinguish between srun/salloc and sbatch submissions. This
plugin forces a specific partition for interactive jobs (and the timelimit
with it) and using the license system it limits the number of simultaneous
interactive jobs per user.

Originally I wrote it because users were running bash for the maximum
allowed time (for no good reason). However, nowadays users are running e.g.
jupyter in sbatch, and the plugin doesn't catches these.

If you want, the plugin's source is in:
https://github.com/irush-cs/slurm-plugins

Yair.

On Wed, Sep 19, 2018 at 8:09 PM, Renfro, Michael  wrote:

> I don’t. If they want to submit a job running ‘bash’ at the same priority
> as a regular batch job shell script, that’s on them. If and when we go to
> an accounting model based off reserved resources and time, it’ll handle
> itself.
>
>
> On Sep 19, 2018, at 11:54 AM, Siddharth Dalmia 
> wrote:
>
> Thanks for your response Mike. I have a follow-up question for this
> approach. How do you restrict someone to start an interactive session on
> the "batch" partition?
>
>
>
>
> On Wed, Sep 19, 2018 at 12:50 PM Renfro, Michael 
> wrote:
>
>> We have multiple partitions using the same nodes. The interactive
>> partition is high priority and limited on time and resources. The batch
>> partition is low priority and has looser time and resource restrictions.
>>
>> And we have a shell function that calls srun —partition=interactive —pty
>> $SHELL to make it easier to submit interactive jobs.
>>
>> --
>> Mike Renfro  / HPC Systems Administrator, Information Technology Services
>> 931 372-3601 / Tennessee Technological University
>>
>> On Sep 19, 2018, at 10:51 AM, Siddharth Dalmia 
>> wrote:
>>
>> Hi all,
>>
>> Is it possible to have a separate timeout for interactive jobs? Or can
>> someone help me come up with a hack to do this?
>>
>> Thanks
>> Sid
>>
>>
>>


Re: [slurm-users] Transparently assign different walltime limit to a group of nodes ?

2018-08-13 Thread Yair Yarom
Hi,

We have a short partition to give a reasonable waiting time for shorter
jobs. We use the job_submit/all_partitions plugin so if a user doesn't
specify a partition, it will add all the partitions.

The downside of the plugin is that if a job is too long for the short
partition (or the job can't run on a partition for some other reasons), the
user will get for example a "PartitionTimeLimit" or "AccountNotAllowed"
reasons instead of "Priority" (though the job will still run eventually).
If that's an issue, writing the above lua plugin might be the way to go.

Regards,
Yair.


On Mon, Aug 13, 2018 at 4:46 PM, Shenglong Wang  wrote:

> Please try to use SLURM Lua plugin, setup two partitions, one for n06-n10
> and one for all nodes, inside SLURM Lua plugs, you can assign jobs to
> different partitions based on requested wall time.
>
> Best,
> Shenglong
>
> On Aug 13, 2018, at 9:44 AM, Cyrus Proctor 
> wrote:
>
> Hi Jens,
>
> Check out https://slurm.schedmd.com/reservations.html specifically the "
> Reservations Floating Through Time" section. In your case, set a walltime
> of 14 days for your partition that contains n[01-10]. Then, create a
> floating reservation on node n[06-10] for n + 1 day where "n" is always
> evaluated as now.
>
> If you wish to allow the user more control, then specify a "Feature" in
> slurm.conf for you nodes. Something like:
> NodeName=n[01-05] Sockets=1 CoresPerSocket=48 ThreadsPerCore=2
> State=UNKNOWN Feature=long
> NodeName=n[06-10] Sockets=1 CoresPerSocket=48 ThreadsPerCore=2
> State=UNKNOWN Feature=short
>
> The feature is an arbitrary string that the admin sets. Then a user could
> specify in their submission as something like:
> sbatch --constraint="long|short" batch.slurm
>
> Best,
> Cyrus
>
> On 08/13/2018 08:28 AM, Loris Bennett wrote:
>
> Hi Jens,
>
> Jens Dreger  
>  writes:
>
>
> Hi everyone!
>
> Is it possible to transparently assign different walltime limits
> to nodes without forcing users to specify partitions when submitting
> jobs?
>
> Example: let's say I have 10 nodes. Nodes n01-n05 should be available
> for jobs with a walltime up to 14 days, while n06-n10 should only
> be used for jobs with a walltime limit less then 1 day. Then as long
> as nodes n06-n10 have free resources, jobs with walltime <1day should
> be scheduled to these nodes. If n06-n10 are full, jobs with walltime
> <1day should start on n01-n05. Users should not have to specify
> partitions.
>
> Would this even be possible to do with just one partition much
> like nodes with different memory size using weights to fill nodes
> with less memoery first?
>
> Background of this question is that it would be helpfull to be able
> to lower the walltime for a rack of nodes, e.g. when adding this rack
> to an existing cluster in order to be able to easily shut down just
> this rack after one day in case of instabilities. Much like adding
> N nodes to a cluster without changing anything else and have only
> jobs with walltime <1day on thiese nodes in the beginning.
>
> If you just want to reduce the allowed wall-time for a given rack, can't
> you just use a maintenance reservation for the appropriate set of nodes?
>
> Cheers,
>
> Loris
>
>
>
>
>


Re: [slurm-users] SLURM PAM support?

2018-06-18 Thread Yair Yarom
Hi,

We encountered this issue some time ago (see:
https://www.mail-archive.com/slurm-dev@schedmd.com/msg06628.html). You
need to add pam_systemd to the slurm pam file, but pam_systemd will
try to take over the slurm's cgroups. Our current solution is to add
pam_systemd to the slurm pam file, but in addition to save/restore the
slurm cgroup locations. It's not pretty, but for now it works...

If you don't constrain the devices (i.e. don't have GPUs), you
probably can do without the pam_exec script and use the pam_systemd
normally.

We're using debian, but the basics should be the same. I've placed the
script in github, if you want to try it:
https://github.com/irush-cs/slurm-scripts

Yair.


On Mon, Jun 18, 2018 at 3:33 PM, John Hearns  wrote:
> Your problem is that you are listening to Lennart Poettering...
> I cannot answer your question directly. However I am doing work at the
> moment with PAM and sssd.
> Have a look at the directory which contains the unit files. Go on
> /lib/systemd/sysem
> See that nice file named -.sliceYes that file is absolutely needed, it
> is not line noise.
> Now try to grep on the files in that directory, since you might want to
> create a new systemd unit file based on an existing one.
>
> Yes, a regexp guru will point out that this is trivial. But to me creating
> files that look like -.slice is putting your head in the lion's mouth.
>
>
>
>
>
> On 18 June 2018 at 14:15, Maik Schmidt  wrote:
>>
>> Hi,
>>
>> we're currently in the process of migrating from RHEL6 to 7, which also
>> brings us the benefit of having systemd. However, we are observing problems
>> with user applications that use e.g. XDG_RUNTIME_DIR, because SLURM
>> apparently does not really run the user application through the PAM stack.
>> The consequence is that SLURM jobs inherit the XDG_* environment variables
>> from the login nodes (where sshd properly sets it up), but on the compute
>> nodes, /run/user/$uid does not exist, leading to errors whenever a user
>> application tries to access it.
>>
>> We have tried setting UsePam=1, but that did not help.
>>
>> I have found the following issue on the systemd project regarding exactly
>> this problem: https://github.com/systemd/systemd/issues/3355
>>
>> There, Lennart Poettering argues that it should be the responsibility of
>> the scheduler software (i.e. SLURM) to run user code only within a proper
>> PAM session.
>>
>> My question: does SLURM support this? If yes, how?
>>
>> If not, what are best practices to circumvent this problem on
>> RHEL7/systemd installations? Surely other clusters must have already had the
>> same issue...
>>
>> Thanks in advance.
>>
>> --
>> Maik Schmidt
>> HPC Services
>>
>> Technische Universität Dresden
>> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
>> Willers-Bau A116
>> D-01062 Dresden
>> Telefon: +49 351 463-32836
>>
>>
>



Re: [slurm-users] run bash script in spank plugin

2018-06-05 Thread Yair Yarom
I'm also in favor of epilog scripts, though it really depends on what
you are eventually trying to achieve.

Also, I'm not sure I understand what you meant by the slurm job
sleeping for 6 seconds and rebooting. You did want it to reboot, no?
The 4 "missing" seconds might be the time difference between the spank
and the job starting times.

If you still require the spank plugin and it doesn't work, I can
suggest trying to remove the '&' from the exec, and the wait() from
the end, though I'm not sure it'll matter. You could monitor the node
directly and see if the spank plugin script is inside a slurm
controlled cgroup. In which case, you'd have to move it out of it.


On Mon, Jun 4, 2018 at 6:37 PM, Brian Andrus  wrote:
> Seems like there are better approaches.
>
> In this situation, I would use an epilogue script and give sudo access to
> the script. Check out https://slurm.schedmd.com/prolog_epilog.html
>
> That would likely be much easier and fit into the methodology slurm uses.
>
> Brian Andrus
> Firstspot, Inc.
>
>
> On 6/4/2018 8:11 AM, Tueur Volvo wrote:
>
> I would like to run a bash script or binary executable as root (even if the
> user who started the job doesn't have root rights) at the end of a job if I
> put an option in my spank plugin
>
> 2018-06-04 16:36 GMT+02:00 John Hearns :
>>
>> That kinnddd  of...  defeats...  the purpose  of a job
>> scheduler.
>> I am very sure that you know why you need this and you have a good reason
>> for doing it.  Over to others on the list, sorry.
>>
>> On 4 June 2018 at 16:15, Tueur Volvo  wrote:
>>>
>>> no I don't have dependency treated.
>>>
>>> during the job, I would like to run a program on the machine running the
>>> job
>>> but I'd like the program to keep running even after the job ends.
>>>
>>> 2018-06-04 15:30 GMT+02:00 John Hearns :
>>>>
>>>> Tueur what are you trying to achieve here?  The example you give is
>>>> touch /tmp/newfile.txt'
>>>> I think you are trying to send a signal to another process. Could this
>>>> be 'Hey - the job has finished and there is a new file for you to process'
>>>> If that is so, there may be better ways to do this. If you have a
>>>> post-processing step, then you can submit a job whihc depends on the main
>>>> job.
>>>> https://hpc.nih.gov/docs/job_dependencies.html
>>>>
>>>> On 4 June 2018 at 15:20, Tueur Volvo  wrote:
>>>>>
>>>>> thanks for your answer, i try some solution but it's not work
>>>>>
>>>>> i try to add setsid and setpgrp for isolate my new process but slurm
>>>>> job sleep 6secondes and reboot my machine (i test with reboot command, but
>>>>> we can make other bash command, it's just example)
>>>>>
>>>>> pid_t cpid; //process id's and process groups
>>>>>
>>>>> cpid = fork();
>>>>>
>>>>> if( cpid == 0 ){
>>>>> setsid();
>>>>> setpgrp();
>>>>> execl("/bin/sh", "sh", "-c", "sleep 10; reboot1&", NULL);
>>>>>
>>>>> }
>>>>> wait(NULL);
>>>>>
>>>>>
>>>>> maybe i have a error in my code ?
>>>>>
>>>>> 2018-05-31 9:37 GMT+02:00 Yair Yarom :
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm not sure how slurm/spank handles child processes but this might be
>>>>>> intentional. So there might be some issues if this were to work.
>>>>>>
>>>>>> You can try instead of calling system(), to use fork() + exec(). If
>>>>>> that still doesn't work, try calling setsid() before the exec(). I can
>>>>>> think of situations where your process might still get killed, e.g. if
>>>>>> slurm (or even systemd) kills all subprocesses of the "job", by
>>>>>> looking at the cgroup. If that's the case, you'll need to move it to
>>>>>> another cgroup in addition/instead of setsid().
>>>>>>
>>>>>> Yair.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 30, 2018 at 5:16 PM, Tueur Volvo 
>>>>>> wrote:
>>>>>> > Hello i have question, how run in background bash script in spank
>>>>>> > plug

Re: [slurm-users] run bash script in spank plugin

2018-05-31 Thread Yair Yarom
Hi,

I'm not sure how slurm/spank handles child processes but this might be
intentional. So there might be some issues if this were to work.

You can try instead of calling system(), to use fork() + exec(). If
that still doesn't work, try calling setsid() before the exec(). I can
think of situations where your process might still get killed, e.g. if
slurm (or even systemd) kills all subprocesses of the "job", by
looking at the cgroup. If that's the case, you'll need to move it to
another cgroup in addition/instead of setsid().

Yair.



On Wed, May 30, 2018 at 5:16 PM, Tueur Volvo  wrote:
> Hello i have question, how run in background bash script in spank plugin ?
>
> in my spank plugin in function : slurm_spank_task_init_privileged
>
> i want to run this script :
>
> system("nohup bash -c 'sleep 10 ; touch /tmp/newfile.txt' &");
>
> i want to run in independant process this bash script, i don't want wait 10
> seconde in my slurm plugin
>
> i have this code :
> int slurm_spank_task_init_privileged (spank_t sp, int ac, char **av) {
>
> system("nohup bash -c 'sleep 10 ; touch /tmp/newfile.txt' &");
>
> return 0;
>
> }
>
> actualy it's not work, when slurm ending to run my job, he kill my nohup
> command
>
> if i  had in my c code sleep 12, my bash script work
>
>
> int slurm_spank_task_init_privileged (spank_t sp, int ac, char **av) {
>
> system("nohup bash -c 'sleep 10 ; touch /tmp/newfile.txt' &");
>
> sleep(12);
>
> return 0;
>
> }
>
> but i don't want to wait, i want to run my bash script in independant
> process
>
> thanks for advance for your help
>
>
>
>



Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Yair Yarom
Hi,

This is what we did, not sure those are the best solutions :)

## Queue stuffing

We have set PriorityWeightAge several magnitudes lower than
PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of
older jobs. As I see it, the fairshare is far more important than age.

Besides the MaxJobs that was suggested, we are considering setting up
maximum allowed TRES resources, and not number of jobs. Otherwise a
user can have a single job that takes the entire cluster, and inside
split it up the way he wants to. As mentioned earlier, It will create
an issue where jobs are pending and there are idle resources, but for
that we have a special preempt-able "requeue" account/qos which users
can use but the jobs there will be killed when "real" jobs arrive.

## Interactive job availability

We have two partitions: short and long. They are indeed fixed where
the short is on 100% of the cluster and the long is about 50%-80% of
the cluster (depending on the cluster).



Re: [slurm-users] Limit job_submit.lua script for only srun

2018-04-26 Thread Yair Yarom
Hi,

We are also limiting "interactive" jobs through a plugin. What I've
found is that in the job_descriptor the following holds:
for salloc: argc = 0, script = NULL
for srun: argc > 0, script = NULL
for sbatch: argc = 0, script != NULL

You can look at our plugin in
https://github.com/irush-cs/slurm-plugins for reference (though it's
in c).

I'll just add that users can overcome these limitation in various ways..

Regards,
Yair.




On Wed, Apr 25, 2018 at 4:26 PM, sysadmin.caos  wrote:
> Hello,
>
> I have written my own job_submit.lua script for limiting "srun" executions
> to one processor, one task and one node. If I test it with "srun", all works
> fine. However, if now I try to run a sbatch job with "-N 12" or "-n 2",
> job_submit.lua is also checked and, then, my job is rejected because I'm
> requesting more than one task and more than one node. So, is it possible
> that lua script only active when user runs a "srun" and not a "sbatch"? I
> have been reading "typedef struct job_descriptor" at slurm/slurm.h file but
> there is no record that keeps command run by the user in the command line.
>
> Thanks.
>



[slurm-users] Should I join the federation?

2018-02-12 Thread Yair Yarom

Hi all,

I was wondering if any of you can share your insights regarding
federations. What unexpected caveats have you encountered?

We have here about about 15 "small" clusters (due to political and
technical reasons), and most users have access to more than one
cluster. Federation seems like a good solution instead of users running
between clusters searching for available resources (we'll probably have
2-4 federations...).

I would also want to have a single submission node, but then users will
still need to select a cluster (we have an lmod module to select a
cluster by setting PATH and SLURM_CONF). The solution I've come up is to
create a dummy cluster with a lot of drained resources. But this seem
like a not-so-good solution and might confuse users with always pending
jobs, and will not work with array jobs.

Also, is there a way to set such that by default jobs will be submitted
to the current cluster instead of the federation (i.e. -M  by
default)? I guess this can be done by a plugin (can it? or does it run
after the sibling submissions?), but I was wondering if there's already
a solution.

Last question :), are there any issues with plugins? i.e. we have
different plugins for different clusters, if they change some of the job
parameters, should I be worried about about plugins from the origin
cluster or from the sibling cluster? Will the job have several plugins
from several clusters activated on it?

Thanks in advance for any advice,
Yair.



[slurm-users] GrpTRES value changes on upgrade from 17.02.1 to 17.11.2

2018-01-28 Thread Yair Yarom

Hi,

We have a license, limited using the GrpTRES of an association (this is
a "license/interactive" for https://github.com/irush-cs/slurm-plugins/).

On upgrade to 17.11.2, I've noticed that all our "license/interactive"
GrpTRES where changed to "billing". Judging by the current tres_table
and our pre-upgrade mysql dump, it appears the type/name was changed but
not all (or any of) the associations using it, so they were still using
TRES id 5, i.e. "billing".

I'm not sure there's currently anything to do (I've simply updated all
our associations), but maybe other upgraders should take notice.

BR,
Yair.

Current sql:
mysql> select * from tres_table;
+---+-+--++-+
| creation_time | deleted | id   | type   | name|
+---+-+--++-+
|1452081576 |   0 |1 | cpu| |
|1452081576 |   0 |2 | mem| |
|1452081576 |   0 |3 | energy | |
|1452081576 |   0 |4 | node   | |
|1517148948 |   0 |5 | billing| |
|1517148948 |   1 | 1000 | dynamic_offset | |
|1478703602 |   0 | 1001 | license| interactive |
|1481124460 |   0 | 1002 | gres   | gpu |
+---+-+--++-+
8 rows in set (0.00 sec)

>From the dump:

LOCK TABLES `tres_table` WRITE;
/*!4 ALTER TABLE `tres_table` DISABLE KEYS */;
INSERT INTO `tres_table` VALUES 
(1452081576,0,1,'cpu',''),(1452081576,0,2,'mem',''),(1452081576,0,3,'energy',''),(1452081576,0,4,'node',''),(1478703602,0,5,'license','interactive'),(1481124460,0,6,'gres','gpu');
/*!4 ALTER TABLE `tres_table` ENABLE KEYS */;
UNLOCK TABLES;



Re: [slurm-users] Mixed x86 and ARM cluster

2018-01-07 Thread Yair Yarom

Hi,

We have here a linux x86 submission node for a power8 compute nodes,
where the slurmctld and slurmdbd are running on an altogether different
freebsd x86 machine. So yes, it should work :)

Just make sure all the daemons are the same version, and take notes of
where the monitoring and maintenance scripts, and the plugins are
running (i.e. with slurmd or slurmctld) so that they will be available
and compatible with the proper architecture.

HTH,
Yair.

On Sun, Jan 07 2018, Steve Caruso  wrote:

> Can slurm run on an x86 server and submit and manage jobs on ARM-based compute
> nodes?
>
> TIA,
> Steve
>
> Sent from Yahoo Mail for iPhone



Re: [slurm-users] lmod and slurm

2017-12-20 Thread Yair Yarom

Thank you all for your advises and insights.

I understand that a fair portion of my time is spent on helping the
users. However, in cases were the error repeats and I need to re-explain
it to a different user each time - I tend to believe there's something
wrong with the system configuration. And it's more fun writing plugins
than explaining the same point over and over again ;)

This specific issue is a very subtle point in the documentations, new
users won't pay attention to it (or understand it), and not-so-new users
won't read it again. So the documentation isn't really helpful.

As such I do want the system to force proper usage as much as possible,
and I prefer the system not working for them instead of seemingly
working but somewhat flawed.

For future reference (if anyone else wants to overly complicate his
system), the plugin I'm currently testing is in
https://github.com/irush-cs/slurm-plugins/ - spank_lmod and
TaskProlog-lmod

Thanks again,
Yair.

On Tue, Dec 19 2017, Gerry Creager - NOAA Affiliate <gerry.crea...@noaa.gov> 
wrote:

> I have to echo Loris' comments. My users tend to experiment, and a fair 
> portion
> of my time is spent helping them correct errors they've inflicted upon
> themselves. I tend to provide guides for configuring and running our more 
> usual
> applications, and then when they fail, I review the guidance with them in my
> office. 
>
> Some of my bigger nightmares begin with one of my truly talented users trying
> something because the procedure he's trying is "just like" what he did on
> another, very different system. Followed closely with "Well it SHOULD work 
> this
> way". We then spend some quality time going over how things really work, and 
> he
> goes away a bit happier, and wiser.
>
> Plan to work with your users and be prepared to train them on nuance. 
>
> Gerry
>
> On Tue, Dec 19, 2017 at 9:33 AM, Loris Bennett <loris.benn...@fu-berlin.de>
> wrote:
>
> Yair Yarom <ir...@cs.huji.ac.il> writes:
> 
> > There are two issues:
> >
> > 1. For the manually loaded modules by users, we can (and are)
> > instructing them to load the modules within their sbatch scripts. The
> > problem is that not all users read the documentation properly, so in
> > the tensorflow example, they use the cpu version of tensorflow
> > (available on the submission node) instead of the gpu version
> > (available on the execution node). Their program works, but slowly,
> > and some of them simply accept it without knowing there's a problem.
> 
> To me, this is just what users do. They make mistakes, not just with
> loading modules, their programs run badly, so I have to tell them what
> they are doing wrong and point them to the documentation. You obviously
> need some sort of monitoring to help you spot the poorly configured jobs.
> 
> > 2. We have modules which we want to be loaded by default, without
> > telling users to load them. These are mostly for programs used by all
> > users and for some settings we want to be set by default (and may be
> > different per host). Letting users call 'module purge' or
> > "--export=NONE" will unload the default modules as well.
> 
> I'm not sure how you want to prevent users from doing 'module purge' at
> a point which will upset the environment you are trying to set up for 
> them.
> 
> > So I basically want to force modules to be unloaded for all jobs - to
> > solve issue 1, while allowing modules to be loaded "automatically" by
> > the system or user - for issue 2.
> 
> There may well be a technical solution to your problem such that
> everything works as it should without the users having to know what is
> going on. However, my approach would be to use a submit plugin to
> reject some badly configured jobs and/or set defaults such that badly
> configured jobs fail quickly. In my experience, if users' jobs fail
> straight away, they mainly learn to do the right thing fairly fast and
> without getting frustrated, provided they get enough support. However,
> your users may be different, so YMMV.
> 
> Cheers,
> 
> Loris
>     
>     
> 
> 
> > Thanks,
> > Yair.
> >
> >
> > On Tue, Dec 19 2017, Jeffrey Frey <f...@udel.edu> wrote:
> >
> >> Don't propagate the submission environment:
> >>
> >> srun --export=NONE myprogram
> >>
> >>
> >>
> >>> On Dec 19, 2017, at 8:37 AM,

Re: [slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom

There are two issues:

1. For the manually loaded modules by users, we can (and are)
   instructing them to load the modules within their sbatch scripts. The
   problem is that not all users read the documentation properly, so in
   the tensorflow example, they use the cpu version of tensorflow
   (available on the submission node) instead of the gpu version
   (available on the execution node). Their program works, but slowly,
   and some of them simply accept it without knowing there's a problem.

2. We have modules which we want to be loaded by default, without
   telling users to load them. These are mostly for programs used by all
   users and for some settings we want to be set by default (and may be
   different per host). Letting users call 'module purge' or
   "--export=NONE" will unload the default modules as well.

So I basically want to force modules to be unloaded for all jobs - to
solve issue 1, while allowing modules to be loaded "automatically" by
the system or user - for issue 2. 

Thanks,
Yair.


On Tue, Dec 19 2017, Jeffrey Frey <f...@udel.edu> wrote:

> Don't propagate the submission environment:
>
> srun --export=NONE myprogram
>
>
>
>> On Dec 19, 2017, at 8:37 AM, Yair Yarom <ir...@cs.huji.ac.il> wrote:
>> 
>> 
>> Thanks for your reply,
>> 
>> The problem is that users are running on the submission node e.g.
>> 
>> module load tensorflow
>> srun myprogram
>> 
>> So they get the tensorflow version (and PATH/PYTHONPATH) of the
>> submission node's version of tensorflow (and any additional default
>> modules).
>> 
>> There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
>> remove the unwanted modules that were loaded (maybe automatically) on
>> the submission node and aren't working on the execution node.
>> 
>> Thanks,
>>Yair.
>> 
>> On Tue, Dec 19 2017, "Loris Bennett" <loris.benn...@fu-berlin.de> wrote:
>> 
>>> Hi Yair,
>>> 
>>> Yair Yarom <ir...@cs.huji.ac.il> writes:
>>> 
>>>> Hi list,
>>>> 
>>>> We use here lmod[1] for some software/version management. There are two
>>>> issues encountered (so far):
>>>> 
>>>> 1. The submission node can have different software than the execution
>>>>   nodes - different cpu, different gpu (if any), infiniband, etc. When
>>>>   a user runs 'module load something' on the submission node, it will
>>>>   pass the wrong environment to the task in the execution
>>>>   node. e.g. "module load tensorflow" can load a different version
>>>>   depending on the nodes.
>>>> 
>>>> 2. There are some modules we want to load by default, and again this can
>>>>   be different between nodes (we do this by source'ing /etc/lmod/lmodrc
>>>>   and ~/.lmodrc).
>>>> 
>>>> For issue 1, we instruct users to run the "module load" in their batch
>>>> script and not before running sbatch, but issue 2 is more problematic.
>>>> 
>>>> My current solution is to write a TaskProlog script that runs "module
>>>> purge" and "module load" and export/unset the changed environment
>>>> variables. I was wondering if anyone encountered this issue and have a
>>>> less cumbersome solution.
>>>> 
>>>> Thanks in advance,
>>>>Yair.
>>>> 
>>>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
>>> 
>>> I don't fully understand your use-case, but, assuming you can divide
>>> your nodes up by some feature, could you define a module per feature
>>> which just loads the specific modules needed for that category, e.g. in
>>> the batch file you would have
>>> 
>>>   #SBATCH --constraint=shiny_and_new
>>> 
>>>   module add ${SLURM_CONSTRAINT}
>>> 
>>> and would have a module file 'shiny_and_new', with contents like, say,
>>> 
>>>  module add tensorflow/2.0
>>>  module add cuda/9.0
>>> 
>>> whereas the module 'rusty_and_old' would contain
>>> 
>>>  module add tensorflow/0.1
>>>  module add cuda/0.2
>>> 
>>> Would that help?
>>> 
>>> Cheers,
>>> 
>>> Loris
>> 
>
>
> ::
> Jeffrey T. Frey, Ph.D.
> Systems Programmer V / HPC Management
> Network & Systems Services / College of Engineering
> University of Delaware, Newark DE  19716
> Office: (302) 831-6034  Mobile: (302) 419-4976
> ::



Re: [slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom

Thanks for your reply,

The problem is that users are running on the submission node e.g.

module load tensorflow
srun myprogram

So they get the tensorflow version (and PATH/PYTHONPATH) of the
submission node's version of tensorflow (and any additional default
modules).

There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
remove the unwanted modules that were loaded (maybe automatically) on
the submission node and aren't working on the execution node.

Thanks,
Yair.

On Tue, Dec 19 2017, "Loris Bennett" <loris.benn...@fu-berlin.de> wrote:

> Hi Yair,
>
> Yair Yarom <ir...@cs.huji.ac.il> writes:
>
>> Hi list,
>>
>> We use here lmod[1] for some software/version management. There are two
>> issues encountered (so far):
>>
>> 1. The submission node can have different software than the execution
>>nodes - different cpu, different gpu (if any), infiniband, etc. When
>>a user runs 'module load something' on the submission node, it will
>>pass the wrong environment to the task in the execution
>>node. e.g. "module load tensorflow" can load a different version
>>depending on the nodes.
>>
>> 2. There are some modules we want to load by default, and again this can
>>be different between nodes (we do this by source'ing /etc/lmod/lmodrc
>>and ~/.lmodrc).
>>
>> For issue 1, we instruct users to run the "module load" in their batch
>> script and not before running sbatch, but issue 2 is more problematic.
>>
>> My current solution is to write a TaskProlog script that runs "module
>> purge" and "module load" and export/unset the changed environment
>> variables. I was wondering if anyone encountered this issue and have a
>> less cumbersome solution.
>>
>> Thanks in advance,
>> Yair.
>>
>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
>
> I don't fully understand your use-case, but, assuming you can divide
> your nodes up by some feature, could you define a module per feature
> which just loads the specific modules needed for that category, e.g. in
> the batch file you would have
>
>#SBATCH --constraint=shiny_and_new
>
>module add ${SLURM_CONSTRAINT}
>
> and would have a module file 'shiny_and_new', with contents like, say,
>
>   module add tensorflow/2.0
>   module add cuda/9.0
>
> whereas the module 'rusty_and_old' would contain
>
>   module add tensorflow/0.1
>   module add cuda/0.2
>
> Would that help?
>
> Cheers,
>
> Loris



[slurm-users] lmod and slurm

2017-12-19 Thread Yair Yarom

Hi list,

We use here lmod[1] for some software/version management. There are two
issues encountered (so far):

1. The submission node can have different software than the execution
   nodes - different cpu, different gpu (if any), infiniband, etc. When
   a user runs 'module load something' on the submission node, it will
   pass the wrong environment to the task in the execution
   node. e.g. "module load tensorflow" can load a different version
   depending on the nodes.

2. There are some modules we want to load by default, and again this can
   be different between nodes (we do this by source'ing /etc/lmod/lmodrc
   and ~/.lmodrc).

For issue 1, we instruct users to run the "module load" in their batch
script and not before running sbatch, but issue 2 is more problematic.

My current solution is to write a TaskProlog script that runs "module
purge" and "module load" and export/unset the changed environment
variables. I was wondering if anyone encountered this issue and have a
less cumbersome solution.

Thanks in advance,
Yair.

[1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod