[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Fulcomer, Samuel via slurm-users
We'd bumped ours up for a while 20+ years ago when we had a flaky
network connection between two buildings holding our compute nodes. If you
need more than 600s you have networking problems.

On Mon, Feb 12, 2024 at 5:41 PM Timony, Mick via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> We set SlurmdTimeout=600. The docs say not to go any higher than 65533
> seconds:
>
> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
>
> The FAQ has info about SlurmdTimeout also. The worst thing that could
> happen is will take longer to set nodes as being down:
> >A node is set DOWN when the slurmd daemon on it stops responding for
> SlurmdTimeout as defined in slurm.conf.
>
> https://slurm.schedmd.com/faq.html
>
> I wouldn't set it too high, but too high vs too low will vary from site to
> site and how busy your controllers are and how busy your network is.
>
> Regards
> --Mick
> --
> *From:* Bjørn-Helge Mevik via slurm-users 
> *Sent:* Monday, February 12, 2024 7:16 AM
> *To:* slurm-us...@schedmd.com 
> *Subject:* [slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds
>
> We've been running one cluster with SlurmdTimeout = 1200 sec for a
> couple of years now, and I haven't seen any problems due to that.
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Fulcomer, Samuel
...and... using the same cluster name is important in our scenario for the
seamless slurmdbd upgrade transition.

In thinking about it a bit more, I'm not sure I'd want to fold together
production and test/dev configs in the same revision control tree. We keep
them separate. There's no reason to baroquify it.

On Wed, Jan 4, 2023 at 1:54 PM Fulcomer, Samuel 
wrote:

> Just make the cluster names the same, with different Nodename and
> Partition lines. The rest of slurm.conf can be the same. Having two cluster
> names is only necessary if you're running production in a multi-cluster
> configuration.
>
> Our model has been to have a production cluster and a test cluster which
> becomes the production cluster at yearly upgrade time (for us, next week).
> The test cluster is also used for rebuilding MPI prior to the upgrade, when
> the PMI changes. We force users to resubmit jobs at upgrade time (after the
> maintenance reservation) to ensure that MPI runs correctly.
>
>
>
> On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob  wrote:
>
>> We currently have a test cluster and a production cluster, all on the
>> same network.  We try things on the test cluster, and then we gather those
>> changes and make a change to the production cluster.  We're doing that
>> through two different repos, but we'd like to have a single repo to make
>> the transition from testing configs to publishing them more seamless.  The
>> problem is, of course, that the test cluster and production clusters have
>> different cluster names, as well as different nodes within them.
>>
>> Using the include directive, I can pull all of the NodeName lines out of
>> slurm.conf and put them into %c-nodes.conf files, one for production, one
>> for test.  That still leaves me with two problems:
>>
>>- The clustername itself will still be a problem.  I WANT the same
>>slurm.conf file between test and production...but the clustername line 
>> will
>>be different for them both.  Can I use an env var in that cluster name,
>>because on production there could be a different env var value than on 
>> test?
>>- The gres.conf file.  I tried using the same "include" trick that
>>works on slurm.conf, but it failed because it did not know what the
>>"ClusterName" was.  I think that means that either it doesn't work for
>>anything other than slurm.conf, or that the clustername will have to be
>>defined in gres.conf as well?
>>
>> Any other suggestions of how to keep our slurm files in a single source
>> control repo, but still have the flexibility to have them run elegantly on
>> either test or production systems?
>>
>> Thanks.
>>
>>


Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Fulcomer, Samuel
Just make the cluster names the same, with different Nodename and Partition
lines. The rest of slurm.conf can be the same. Having two cluster names is
only necessary if you're running production in a multi-cluster
configuration.

Our model has been to have a production cluster and a test cluster which
becomes the production cluster at yearly upgrade time (for us, next week).
The test cluster is also used for rebuilding MPI prior to the upgrade, when
the PMI changes. We force users to resubmit jobs at upgrade time (after the
maintenance reservation) to ensure that MPI runs correctly.



On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob  wrote:

> We currently have a test cluster and a production cluster, all on the same
> network.  We try things on the test cluster, and then we gather those
> changes and make a change to the production cluster.  We're doing that
> through two different repos, but we'd like to have a single repo to make
> the transition from testing configs to publishing them more seamless.  The
> problem is, of course, that the test cluster and production clusters have
> different cluster names, as well as different nodes within them.
>
> Using the include directive, I can pull all of the NodeName lines out of
> slurm.conf and put them into %c-nodes.conf files, one for production, one
> for test.  That still leaves me with two problems:
>
>- The clustername itself will still be a problem.  I WANT the same
>slurm.conf file between test and production...but the clustername line will
>be different for them both.  Can I use an env var in that cluster name,
>because on production there could be a different env var value than on 
> test?
>- The gres.conf file.  I tried using the same "include" trick that
>works on slurm.conf, but it failed because it did not know what the
>"ClusterName" was.  I think that means that either it doesn't work for
>anything other than slurm.conf, or that the clustername will have to be
>defined in gres.conf as well?
>
> Any other suggestions of how to keep our slurm files in a single source
> control repo, but still have the flexibility to have them run elegantly on
> either test or production systems?
>
> Thanks.
>
>


Re: [slurm-users] Dell <> GPU compatibility matrix?

2022-10-27 Thread Fulcomer, Samuel
The NVIDIA A10 would probably work. Check the Dell specs for card lengths
that it can accommodate. It's also passively cooled, so you'd need to
ensure that there's good airflow through the card. The proof would be
installing a card, and watching the temp when you run apps on it. It's
150W, so not that hot.



On Thu, Oct 27, 2022 at 11:03 AM Chip Seraphine 
wrote:

> We have a cluster of 1U dells (R640s and R650s) and we’ve been asked to
> install GPUs in them, specifically NVIDIA Teslas with at least 24GB RAM, so
> I’m trying to select the right card.  In the past I’ve used Tesla T4s on
> similar hardware, but those are limited to 16GB.   I know most of the
> really high-end GPUs won’t physically fit in a 1U server.
>
> Does anyone know of a resource that will tell me which models of GPUS
> (specifically Teslas) do/do-not fit in various Dell boxes?   It seems like
> both Nvidia and Dell would be motivated to provide such a compatability
> list but so far I’ve been unable to find one for this sort of enterprise
> equipment (although they abound for desktops and consumer cards).
>
>
> --
>
> Chip Seraphine
>
> This e-mail and any attachments may contain information that is
> confidential and proprietary and otherwise protected from disclosure. If
> you are not the intended recipient of this e-mail, do not read, duplicate
> or redistribute it by any means. Please immediately delete it and any
> attachments and notify the sender that you have received it by mistake.
> Unintended recipients are prohibited from taking action on the basis of
> information in this e-mail or any attachments. The DRW Companies make no
> representations that this e-mail or any attachments are free of computer
> viruses or other defects.
>


Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Fulcomer, Samuel
Hi Byron,

We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:

net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192


# prevent neighbour (aka ARP) table overflow...

net.ipv4.neigh.default.gc_thresh1 = 3
net.ipv4.neigh.default.gc_thresh2 = 32000
net.ipv4.neigh.default.gc_thresh3 = 32768
net.ipv4.neigh.default.mcast_solicit = 9
net.ipv4.neigh.default.ucast_solicit = 9
net.ipv4.neigh.default.gc_stale_time = 86400
net.ipv4.neigh.eth0.mcast_solicit = 9
net.ipv4.neigh.eth0.ucast_solicit = 9
net.ipv4.neigh.eth0.gc_stale_time = 86400

# enable selective ack algorithm
net.ipv4.tcp_sack = 1

# workaround TIME_WAIT
net.ipv4.tcp_tw_reuse = 1
# and since all traffic is local
net.ipv4.tcp_fin_timeout = 20


We have a 16-bit cluster network, so the ARP settings date to that.
tcp_sack is more of a legacy setting from when some kernels didn't set it.

You likely would see tons of connections in TIME_WAIT if you ran "netstat
-a" during periods when you're seeing the hangs. Our workaround settings
have seemed to mitigate that.



On Thu, Jul 28, 2022 at 9:29 AM byron  wrote:

> Hi
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
> (3 times in 2 months) have slurmctld hanging so we get the following
> message when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> There is nothing to be found in the slurmctld log.
>
> Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
>
> Thanks
>
>


Re: [slurm-users] How to open a slurm support case

2022-03-24 Thread Fulcomer, Samuel
...it is a bit arcane, but it's not like we're funding lavish
lifestyles with our support payments. I would prefer to see a slightly more
differentiated support system, but this suffices...

On Thu, Mar 24, 2022 at 6:06 PM Sean Crosby  wrote:

> Hi Jeff,
>
> The support system is here - https://bugs.schedmd.com/
>
> Create an account, log in, and when creating a request, select your site
> from the Site selection box.
>
> Sean
> --
> *From:* slurm-users  on behalf of
> Jeffrey R. Lang 
> *Sent:* Friday, 25 March 2022 08:48
> *To:* slurm-users@lists.schedmd.com 
> *Subject:* [EXT] [slurm-users] How to open a slurm support case
>
> * External email: Please exercise caution *
> --
>
> Can someone provide me with instructions on how to open a support case
> with SchedMD?
>
>
>
> We have a support contract, but no where on their website can I find a
> link to open a case with them.
>
>
>
> Thanks,
>
> Jeff
>


Re: [slurm-users] QOS time limit tighter than partition limit

2021-12-16 Thread Fulcomer, Samuel
...and you shouldn't be able to do this with a QoS (I think as you want it
to), as "grptresrunmins" applies to the aggregate of everything using the
QoS.

On Thu, Dec 16, 2021 at 6:12 PM Fulcomer, Samuel 
wrote:

> I've not parsed your message very far, but...
>
> for i in `cat limit_users` ; do
>
> sacctmgr where user=$i partition=foo account=bar set
> grptresrunmins=cpu=Nlimit
>
> On Thu, Dec 16, 2021 at 6:01 PM Ross Dickson 
> wrote:
>
>> It would like to impose a time limit stricter than the partition limit on
>> a certain subset of users.  I should be able to do this with a QOS, but I
>> can't get it to work.  What am I missing?
>>
>> At https://slurm.schedmd.com/resource_limits.html it says,
>> "Slurm's hierarchical limits are enforced in the following order ...:
>>
>> 1. Partition QOS limit
>> 2. Job QOS limit
>> 3. User association
>> 4. Account association(s), ascending the hierarchy
>> 5. Root/Cluster association
>> 6. Partition limit
>> 7. None
>>
>> Note: If limits are defined at multiple points in this hierarchy, the
>> point in this list where the limit is first defined will be used."
>>
>> And there's a little more later about the Partition limit being an upper
>> bound on everything.
>>
>> This says to me that if:
>> * there is a large time limit on a partition,
>> * there is a smaller time limit on the job QOS, and
>> * the partition has no associated QOS,
>> then the MaxWall on the Job QOS should have effect.
>>
>> But that's not what I observe.  I've created a QOS 'nonpaying' with
>> MaxWall=1-0:0:0, and set MaxTime=7-0:0:0 on partition 'general'.  I set the
>> association on  user1 so that their job will get QOS 'nonpaying', then
>> submit a job with --time=7-0:0:0, and it runs:
>>
>> $ scontrol show partition general | egrep 'QoS|MaxTime'
>>AllocNodes=ALL Default=YES QoS=N/A
>>MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO
>> MaxCPUsPerNode=UNLIMITED
>> $ sacctmgr show qos nonpaying format=name,flags,maxwall
>>   NameFlags MaxWall
>> --  ---
>>  nonpaying   1-00:00:00
>> $ scontrol show job 33 | egrep 'QOS|JobState|TimeLimit'
>>Priority=4294901728 Nice=0 Account=acad1 QOS=nonpaying
>>JobState=RUNNING Reason=None Dependency=(null)
>>RunTime=00:00:40 TimeLimit=7-00:00:00 TimeMin=N/A
>> $ scontrol show config | grep AccountingStorageEnforce
>> AccountingStorageEnforce = associations,limits,qos
>>
>> Help!?
>>
>> --
>> Ross Dickson, Computational Research Consultant
>> ACENET  --   Compute Canada  --  Dalhousie University
>>
>


Re: [slurm-users] QOS time limit tighter than partition limit

2021-12-16 Thread Fulcomer, Samuel
I've not parsed your message very far, but...

for i in `cat limit_users` ; do

sacctmgr where user=$i partition=foo account=bar set
grptresrunmins=cpu=Nlimit

On Thu, Dec 16, 2021 at 6:01 PM Ross Dickson 
wrote:

> It would like to impose a time limit stricter than the partition limit on
> a certain subset of users.  I should be able to do this with a QOS, but I
> can't get it to work.  What am I missing?
>
> At https://slurm.schedmd.com/resource_limits.html it says,
> "Slurm's hierarchical limits are enforced in the following order ...:
>
> 1. Partition QOS limit
> 2. Job QOS limit
> 3. User association
> 4. Account association(s), ascending the hierarchy
> 5. Root/Cluster association
> 6. Partition limit
> 7. None
>
> Note: If limits are defined at multiple points in this hierarchy, the
> point in this list where the limit is first defined will be used."
>
> And there's a little more later about the Partition limit being an upper
> bound on everything.
>
> This says to me that if:
> * there is a large time limit on a partition,
> * there is a smaller time limit on the job QOS, and
> * the partition has no associated QOS,
> then the MaxWall on the Job QOS should have effect.
>
> But that's not what I observe.  I've created a QOS 'nonpaying' with
> MaxWall=1-0:0:0, and set MaxTime=7-0:0:0 on partition 'general'.  I set the
> association on  user1 so that their job will get QOS 'nonpaying', then
> submit a job with --time=7-0:0:0, and it runs:
>
> $ scontrol show partition general | egrep 'QoS|MaxTime'
>AllocNodes=ALL Default=YES QoS=N/A
>MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
> $ sacctmgr show qos nonpaying format=name,flags,maxwall
>   NameFlags MaxWall
> --  ---
>  nonpaying   1-00:00:00
> $ scontrol show job 33 | egrep 'QOS|JobState|TimeLimit'
>Priority=4294901728 Nice=0 Account=acad1 QOS=nonpaying
>JobState=RUNNING Reason=None Dependency=(null)
>RunTime=00:00:40 TimeLimit=7-00:00:00 TimeMin=N/A
> $ scontrol show config | grep AccountingStorageEnforce
> AccountingStorageEnforce = associations,limits,qos
>
> Help!?
>
> --
> Ross Dickson, Computational Research Consultant
> ACENET  --   Compute Canada  --  Dalhousie University
>


Re: [slurm-users] Prevent users from updating their jobs

2021-12-16 Thread Fulcomer, Samuel
There's no clear answer to this. It depends a bit on how you've segregated
your resources.

In our environment, GPU and bigmem nodes are in their own partitions.
There's nothing to prevent a user from specifying a list of potential
partitions in the job submission, so there would be no need for them to do
a post-submission "scontrol update jobid" to push a job into a partition
that violated the spirit of the service.

Our practice has been to periodically look at running jobs to see if they
are using (or have used, in the case of bigmem) less than their requested
resources, and send them a nastygram telling them to stop doing that.

Creating a LUA submission script that, e.g., blocks jobs from the gpu queue
that don't request gpus only helps to weed out the naive users. A
subversive user could request a gpu and only use the allocated cores and
memory. There's no way to deal with this other than monitoring running jobs
and nastygrams, with removal of access after repeated offenses.

On Thu, Dec 16, 2021 at 3:36 PM Jordi Blasco  wrote:

> Hi everyone,
>
> I was wondering if there is a way to prevent users from updating their
> jobs with "scontrol update job".
>
> Here is the justification.
>
> A hypothetical user submits a job requesting a regular node, but
> he/she realises that the large memory nodes or the GPU nodes are idle.
> Using the previous command, users can request the job to use one of those
> resources to avoid waiting without a real need for using them.
>
> Any suggestions to prevent that?
>
> Cheers,
>
> Jordi
>
> sbatch --mem=1G -t 0:10:00 --wrap="srun -n 1 sleep 360"
> scontrol update job 791 Features=smp
>
> [user01@slurm-simulator ~]$ sacct -j 791 -o "jobid,nodelist,user"
>JobIDNodeList  User
>  --- -
> 791smp-1user01
>


Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
...and I'm not sure what "AutoDetect=NVML" is supposed to do in the
gres.conf file. We've always used "nvidia-smi topo -m" to confirm that
we've got a single-root or dual-root node and have entered the correct info
in gres.conf to map connections to the CPU sockets, e.g.:

# 8-gpu A6000 nodes - dual-root
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7] CPUs=24-47





On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel 
wrote:

> Well... you've got lots of weirdness, as the scontrol show job command
> isn't listing any GPU TRES requests, and the scontrol show node command
> isn't listing any configured GPU TRES resources.
>
> If you send me your entire slurm.conf I'll have a quick look-over.
>
> You also should be using cgroup.conf to fence off the GPU devices so that
> a job only sees the GPUs that it's been allocated. The lines in the batch
> file to figure it out aren't necessary. I forgot to ask you about
> cgroup.conf.
>
> regards,
> Sam
>
> On Fri, Aug 20, 2021 at 5:46 PM Andrey Malyutin 
> wrote:
>
>> Thank you Samuel,
>>
>> Slurm version is 20.02.6. I'm not entirely sure about the platform,
>> RTX6000 nodes are about 2 years old, and 3090 node is very recent.
>> Technically we have 4 nodes (hence references to node04 in info below), but
>> one of the nodes is down and out of the system at the moment. As you see,
>> the job really wants to run on the downed node instead of going to node02
>> or node03.
>>
>> Thank you again,
>> Andrey
>>
>>
>>
>> *scontrol info:*
>>
>> JobId=283 JobName=cryosparc_P2_J214
>>
>>UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A
>>
>>Priority=4294901572 Nice=0 Account=(null) QOS=normal
>>
>>JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:node04
>> Dependency=(null)
>>
>>Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>
>>RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>>
>>SubmitTime=2021-08-20T20:55:00 EligibleTime=2021-08-20T20:55:00
>>
>>AccrueTime=2021-08-20T20:55:00
>>
>>StartTime=Unknown EndTime=Unknown Deadline=N/A
>>
>>SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-20T23:36:14
>>
>>Partition=CSCluster AllocNode:Sid=headnode:108964
>>
>>ReqNodeList=(null) ExcNodeList=(null)
>>
>>NodeList=(null)
>>
>>NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>>TRES=cpu=4,mem=24000M,node=1,billing=4
>>
>>Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>
>>MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0
>>
>>Features=(null) DelayBoot=00:00:00
>>
>>OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>>
>>
>> Command=/data/backups/takeda2/data/cryosparc_projects/P8/J214/queue_sub_script.sh
>>
>>WorkDir=/ssd/CryoSparc/cryosparc_master
>>
>>StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>StdIn=/dev/null
>>
>>StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>Power=
>>
>>TresPerNode=gpu:1
>>
>>MailUser=cryosparc MailType=NONE
>>
>>
>> *Script:*
>>
>> #SBATCH --job-name cryosparc_P2_J214
>>
>> #SBATCH -n 4
>>
>> #SBATCH --gres=gpu:1
>>
>> #SBATCH -p CSCluster
>>
>> #SBATCH --mem=24000MB
>>
>> #SBATCH
>> --output=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>> #SBATCH
>> --error=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>
>>
>> available_devs=""
>>
>> for devidx in $(seq 0 15);
>>
>> do
>>
>> if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid
>> --format=csv,noheader) ]] ; then
>>
>> if [[ -z "$available_devs" ]] ; then
>>
>> available_devs=$devidx
>>
>> else
>>
>> available_devs=$available_devs,$devidx
>>
>> fi
>>
>> fi
>>
>> done
>>
>> export CUDA_VISIBLE_DEVICES=$available_devs
>>
>>
>>
>> /ssd/CryoSparc/cryosparc_worker/bin/cryosparcw run --project P2 --job
>> J214 --master_hostname headnode.cm.cluster --master_command_core_port 39002
>> > /data/backups/takeda2/data/c

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
efault=YES MinNodes=1 DefaultTime=UNLIMITED
> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
> Nodes=node[01-04]
>
> PartitionName=CSLive MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED
> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO
> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node01
>
> PartitionName=CSCluster MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED
> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO
> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node[02-04]
>
> ClusterName=slurm
>
>
>
> *Gres.conf*
>
> # This section of this file was automatically generated by cmd. Do not
> edit manually!
>
> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>
> AutoDetect=NVML
>
> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>
> #Name=gpu File=/dev/nvidia[0-3] Count=4
>
> #Name=mic Count=0
>
>
>
> *Sinfo:*
>
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>
> defq*up   infinite  1  down* node04
>
> defq*up   infinite  3   idle node[01-03]
>
> CSLive   up   infinite  1   idle node01
>
> CSClusterup   infinite  1  down* node04
>
> CSClusterup   infinite  2   idle node[02-03]
>
>
>
> *Node1:*
>
> NodeName=node01 Arch=x86_64 CoresPerSocket=16
>
>CPUAlloc=0 CPUTot=64 CPULoad=0.04
>
>AvailableFeatures=RTX3090
>
>ActiveFeatures=RTX3090
>
>Gres=gpu:4
>
>NodeAddr=node01 NodeHostName=node01 Version=20.02.6
>
>OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020
>
>RealMemory=386048 AllocMem=0 FreeMem=16665 Sockets=2 Boards=1
>
>State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>Partitions=defq,CSLive
>
>BootTime=2021-08-04T13:59:08 SlurmdStartTime=2021-08-10T09:32:43
>
>CfgTRES=cpu=64,mem=377G,billing=64
>
>AllocTRES=
>
>CapWatts=n/a
>
>CurrentWatts=0 AveWatts=0
>
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
> *Node2-3*
>
> NodeName=node02 Arch=x86_64 CoresPerSocket=16
>
>CPUAlloc=0 CPUTot=64 CPULoad=0.48
>
>AvailableFeatures=RTX6000
>
>ActiveFeatures=RTX6000
>
>Gres=gpu:4(S:0-1)
>
>NodeAddr=node02 NodeHostName=node02 Version=20.02.6
>
>OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020
>
>RealMemory=257024 AllocMem=0 FreeMem=2259 Sockets=2 Boards=1
>
>State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>
>Partitions=defq,CSCluster
>
>BootTime=2021-07-29T20:47:32 SlurmdStartTime=2021-08-10T09:32:55
>
>CfgTRES=cpu=64,mem=251G,billing=64
>
>AllocTRES=
>
>CapWatts=n/a
>
>CurrentWatts=0 AveWatts=0
>
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> On Thu, Aug 19, 2021, 6:07 PM Fulcomer, Samuel 
> wrote:
>
>> What SLURM version are you running?
>>
>> What are the #SLURM directives in the batch script? (or the sbatch
>> arguments)
>>
>> When the single GPU jobs are pending, what's the output of 'scontrol show
>> job JOBID'?
>>
>> What are the node definitions in slurm.conf, and the lines in gres.conf?
>>
>> Are the nodes all the same host platform (motherboard)?
>>
>> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX
>> 1s, A6000s, and A40s, with a mix of single and dual-root platforms, and
>> haven't seen this problem with SLURM 20.02.6 or earlier versions.
>>
>> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin 
>> wrote:
>>
>>> Hello,
>>>
>>> We are in the process of finishing up the setup of a cluster with 3
>>> nodes, 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any
>>> job asking for 1 GPU in the submission script will wait to run on the 3090
>>> node, no matter resource availability. Same job requesting 2 or more GPUs
>>> will run on any node. I don't even know where to begin troubleshooting this
>>> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
>>> help would be appreciated. (If helpful - this cluster is used for
>>> structural biology, with cryosparc and relion packages).
>>>
>>> Thank you,
>>> Andrey
>>>
>>


Re: [slurm-users] GPU jobs not running correctly

2021-08-19 Thread Fulcomer, Samuel
What SLURM version are you running?

What are the #SLURM directives in the batch script? (or the sbatch
arguments)

When the single GPU jobs are pending, what's the output of 'scontrol show
job JOBID'?

What are the node definitions in slurm.conf, and the lines in gres.conf?

Are the nodes all the same host platform (motherboard)?

We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX 1s,
A6000s, and A40s, with a mix of single and dual-root platforms, and haven't
seen this problem with SLURM 20.02.6 or earlier versions.

On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin 
wrote:

> Hello,
>
> We are in the process of finishing up the setup of a cluster with 3 nodes,
> 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
> asking for 1 GPU in the submission script will wait to run on the 3090
> node, no matter resource availability. Same job requesting 2 or more GPUs
> will run on any node. I don't even know where to begin troubleshooting this
> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
> help would be appreciated. (If helpful - this cluster is used for
> structural biology, with cryosparc and relion packages).
>
> Thank you,
> Andrey
>


Re: [slurm-users] History of pending jobs

2021-07-30 Thread Fulcomer, Samuel
XDMoD can do that for you, but bear in mind that wait/pending time by
itself may not be particularly useful.

Consider the extreme scenario in which a user is only allowed to use one
node at a time, but submits a thousand one-day jobs. Without any other
competition for resources, the average wait/pending time would be five
hundred days.

On Fri, Jul 30, 2021 at 2:44 PM Glenn (Gedaliah) Wolosh 
wrote:

> I'm interested on getting an idea how long jobs were pending in a
> particular partition. Is there any magic to sreport or sacct that can
> generate this info.
>
> I could also use something like:"sreport cluster utilization" broken down
> by partition.
>
> Any help would be appreciated.
>
>
>
> [image: NJIT logo]  *Glenn (Gedaliah) Wolosh,
> Ph.D.*
> Ass't Director Research Software and Cloud Computing
> Acad & Research Computing Systems
> gwol...@njit.edu • (973) 596-5437 <(973)%20596-5437>
>
> A Top 100 National University
> *U.S. News & World Report*
>
>
>
>
>
>


Re: [slurm-users] Incorrect Number of GPUs?

2021-07-26 Thread Fulcomer, Samuel
Yeah, you'd think after all this time it would, bu it remains a bit of
arcane knowledge that's mostly passed on in oral history

There are some things that the slurmd processes need to be restarted for,
as well. I have a vague memory that changing the debug level is one...

On Mon, Jul 26, 2021 at 1:32 PM Jason Simms  wrote:

> Dear Samuel,
>
> Restarting slurmctld did the trick. Thanks! I should have thought to do
> that, but typically sconrtrol reconfigure picks up most changes.
>
> Warmest regards,
> Jason
>
> On Mon, Jul 26, 2021 at 12:55 PM Fulcomer, Samuel <
> samuel_fulco...@brown.edu> wrote:
>
>> ...and... you need to restart slurmctld when you change a NodeName line.
>> "scontrol reconfigure" doesn't do the truck.
>>
>> On Mon, Jul 26, 2021 at 12:49 PM Fulcomer, Samuel <
>> samuel_fulco...@brown.edu> wrote:
>>
>>> If you have a dual-root PCIe system you may need to specify the CPU/core
>>> affinity in gres.conf.
>>>
>>> On Mon, Jul 26, 2021 at 12:07 PM Jason Simms 
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I have a GPU node with 3 identical GPUs (we started with two and
>>>> recently added the third). Running nvidia-smi correctly shows that all
>>>> three are recognized. My gres.conf file has only this line:
>>>>
>>>> NodeName=gpu01 File=/dev/nvidia[0-2] Type=quadro_8000 Name=gpu Count=3
>>>>
>>>> And the relevant lines in slurm.conf are:
>>>>
>>>> NodeName=gpu01 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
>>>> RealMemory=189900 State=UNKNOWN Gres=gpu:quadro_8000:3
>>>>
>>>> As far as I can tell, all of this is fine (and we had no issues when we
>>>> only had the initial two GPUs in the system). However, now when I run sinfo
>>>> -o %G (which as I understand will report the total number of gres
>>>> resources available), this is the output:
>>>>
>>>> GRES
>>>> (null)
>>>> gpu:quadro_8000:2
>>>>
>>>> Is this saying that it doesn't recognize the third card? Any
>>>> suggestions? As always, thank you for your help!
>>>>
>>>> Warmest regards,
>>>> Jason
>>>>
>>>> --
>>>> *Jason L. Simms, Ph.D., M.P.H.*
>>>> Manager of Research and High-Performance Computing
>>>> XSEDE Campus Champion
>>>> Lafayette College
>>>> Information Technology Services
>>>> 710 Sullivan Rd | Easton, PA 18042
>>>> Office: 112 Skillman Library
>>>> p: (610) 330-5632
>>>>
>>>
>
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
>


Re: [slurm-users] Incorrect Number of GPUs?

2021-07-26 Thread Fulcomer, Samuel
...and... you need to restart slurmctld when you change a NodeName line.
"scontrol reconfigure" doesn't do the truck.

On Mon, Jul 26, 2021 at 12:49 PM Fulcomer, Samuel 
wrote:

> If you have a dual-root PCIe system you may need to specify the CPU/core
> affinity in gres.conf.
>
> On Mon, Jul 26, 2021 at 12:07 PM Jason Simms  wrote:
>
>> Hello all,
>>
>> I have a GPU node with 3 identical GPUs (we started with two and recently
>> added the third). Running nvidia-smi correctly shows that all three are
>> recognized. My gres.conf file has only this line:
>>
>> NodeName=gpu01 File=/dev/nvidia[0-2] Type=quadro_8000 Name=gpu Count=3
>>
>> And the relevant lines in slurm.conf are:
>>
>> NodeName=gpu01 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
>> RealMemory=189900 State=UNKNOWN Gres=gpu:quadro_8000:3
>>
>> As far as I can tell, all of this is fine (and we had no issues when we
>> only had the initial two GPUs in the system). However, now when I run sinfo
>> -o %G (which as I understand will report the total number of gres
>> resources available), this is the output:
>>
>> GRES
>> (null)
>> gpu:quadro_8000:2
>>
>> Is this saying that it doesn't recognize the third card? Any suggestions?
>> As always, thank you for your help!
>>
>> Warmest regards,
>> Jason
>>
>> --
>> *Jason L. Simms, Ph.D., M.P.H.*
>> Manager of Research and High-Performance Computing
>> XSEDE Campus Champion
>> Lafayette College
>> Information Technology Services
>> 710 Sullivan Rd | Easton, PA 18042
>> Office: 112 Skillman Library
>> p: (610) 330-5632
>>
>


Re: [slurm-users] Incorrect Number of GPUs?

2021-07-26 Thread Fulcomer, Samuel
If you have a dual-root PCIe system you may need to specify the CPU/core
affinity in gres.conf.

On Mon, Jul 26, 2021 at 12:07 PM Jason Simms  wrote:

> Hello all,
>
> I have a GPU node with 3 identical GPUs (we started with two and recently
> added the third). Running nvidia-smi correctly shows that all three are
> recognized. My gres.conf file has only this line:
>
> NodeName=gpu01 File=/dev/nvidia[0-2] Type=quadro_8000 Name=gpu Count=3
>
> And the relevant lines in slurm.conf are:
>
> NodeName=gpu01 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=189900 State=UNKNOWN Gres=gpu:quadro_8000:3
>
> As far as I can tell, all of this is fine (and we had no issues when we
> only had the initial two GPUs in the system). However, now when I run sinfo
> -o %G (which as I understand will report the total number of gres
> resources available), this is the output:
>
> GRES
> (null)
> gpu:quadro_8000:2
>
> Is this saying that it doesn't recognize the third card? Any suggestions?
> As always, thank you for your help!
>
> Warmest regards,
> Jason
>
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
>


Re: [slurm-users] Priority Access to GPU?

2021-07-12 Thread Fulcomer, Samuel
Jason,

I've just been working through a similar scenario to handle access to our
3090 nodes that have been purchased by researchers.

I suggest putting the node into an additional partition, and then add a QOS
for the lab group that has grptres=gres/gpu=1,cpu=M,mem=N (where cpu and
mem are whatever are reasonable). Only create associations for the lab
group members for that partition. The lab group members can then submit to
both partitions using "--partition=special,regular", where "special" is the
additional partition, and "regular" is the original partition. If the QOS
or partition has a high priority assigned to it, then a lab group
member's job should always run next on the same gpu that had been
previously allocated. That way only one job should be preempted to allow
the execution of multiple, successive lab group jobs.

Regards,
Sam

On Mon, Jul 12, 2021 at 3:38 PM Jason Simms  wrote:

> Dear all,
>
> I feel like I've attempted to track this down before but have never fully
> understood how to accomplish this.
>
> I have a GPU node with three GPU cards, one of which was purchased by a
> user. I want to provide priority access for that user to the card, while
> still allowing it to be used by the community when not in use by that
> particular user. Well, more specifically, I'd like a specific account
> within Slurm to have priority access; the account contains multiple
> accounts that are part of the faculty's lab group.
>
> I have such access properly configured for the actual nodes (priority
> preempt), but the GPU (which is configured as a GRES) seems like a beast of
> a different color.
>
> Warmest regards,
> Jason
>
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
>


Re: [slurm-users] 答复: how to check what slurm is doing when job pending with reason=none?

2021-06-17 Thread Fulcomer, Samuel
You can specify a partition priority in the partition line in slurm.conf,
e.g. Priority=65000 (I forget what the max is...)

On Thu, Jun 17, 2021 at 10:31 PM  wrote:

> Thanks for the help. We tried to reduce the sched_interval and the pending
> time decreased as expected.
>
> But the influence of 'sched_interval' is global, setting it too small may
> put pressure on slurmctld server. Since we only want quick response on
> debug
> partition (which is designed to let user frequently submitting debug jobs
> without waiting), is it possible to make slurm do immediate schedual on the
> specific partition no matter how long the job queue is?
>
> -邮件原件-
> 发件人: Gerhard Strangar 
> 发送时间: 2021年6月17日 0:27
> 收件人: Slurm User Community List 
> 主题: Re: [slurm-users] how to check what slurm is doing when job pending
> with reason=none?
>
> taleinterve...@sjtu.edu.cn wrote:
>
> > But after submit, this job still stay at PENDING state for about
> > 30-60s and during the pending time sacct shows the REASON is "None".
>
> It's the default sched_interval=60 in your slurm.conf.
>
> Gerhard
>
>
>
>
>


Re: [slurm-users] monitor draining/drain nodes

2021-06-12 Thread Fulcomer, Samuel
...sorry... "sinfo | grep drain && sinfo | grep drain | mail -s 'drain
nodes'  "



On Sat, Jun 12, 2021 at 4:46 PM Fulcomer, Samuel 
wrote:

> ...something like "sinfo | grep drain && mail -s 'drain nodes'  address> "
>
> ...will work...
>
> Substitute "draining" or "drained" for "drain" to taste...
>
> On Sat, Jun 12, 2021 at 4:32 PM Rodrigo Santibáñez <
> rsantibanez.uch...@gmail.com> wrote:
>
>> Hi SLURM users,
>>
>> Does anyone have a cronjob or similar to monitor and warn via e-mail when
>> a node is in draining/drain status?
>>
>> Thank you.
>>
>> Best regards.
>> Rodrigo Santibáñez
>>
>


Re: [slurm-users] monitor draining/drain nodes

2021-06-12 Thread Fulcomer, Samuel
...something like "sinfo | grep drain && mail -s 'drain nodes'  "

...will work...

Substitute "draining" or "drained" for "drain" to taste...

On Sat, Jun 12, 2021 at 4:32 PM Rodrigo Santibáñez <
rsantibanez.uch...@gmail.com> wrote:

> Hi SLURM users,
>
> Does anyone have a cronjob or similar to monitor and warn via e-mail when
> a node is in draining/drain status?
>
> Thank you.
>
> Best regards.
> Rodrigo Santibáñez
>


Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Fulcomer, Samuel
inline below...

On Sat, Apr 3, 2021 at 4:50 PM Will Dennis  wrote:

> Sorry, obvs wasn’t ready to send that last message yet…
>
>
>
> Our issue is the shared storage is via NFS, and the “fast storage in
> limited supply” is only local on each node. Hence the need to copy it over
> from NFS (and then remove it when finished with it.)
>
> I also wanted the copy & remove to be different jobs, because the main
> processing job usually requires GPU gres, which is a time-limited resource
> on the partition. I don’t want to tie up the allocation of GPUs while the
> data is staged (and removed), and if the data copy fails, don’t want to
> even progress to the job where the compute happens (so like,
> copy_data_locally && process_data)
>

...yup... this is the problem. We've invested in GPFS and an NVMe Excelero
pool (for initial placement); however, we still have the problem of having
users pull down data from community repositories before running useful
computation.

Your question has gotten me thinking about this more. In our case, all of
our nodes are diskless, so this wouldn't really work for us (but we do have
fast GPFS), but if your fast storage is only local to your nodes, the
subsequent compute jobs will need to request those specific nodes, so
you'll need to have a mechanism to increase the SLURM scheduling  "weight"
of the nodes after staging, so the scheduler won't select them over nodes
with a lower weight. That could be done in a job epilog.




>
> If you've got other fast storage in limited supply that can be used for
> data that can be staged, then by all means use it, but consider whether you
> want batch cpu cores tied up with the wall time of transferring the data.
> This could easily be done on a time-shared frontend login node from which
> the users could then submit (via script) jobs after the data was staged.
> Most of the transfer wallclock is in network wait, so don't waste dedicated
> cores for it.
>
>
>
>


Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Fulcomer, Samuel
Hi,

"scratch space" is generally considered ephemeral storage that only exists
for the duration of the job (It's eligible for deletion in an epilog or
next-job prolog) .

If you've got other fast storage in limited supply that can be used for
data that can be staged, then by all means use it, but consider whether you
want batch cpu cores tied up with the wall time of transferring the data.
This could easily be done on a time-shared frontend login node from which
the users could then submit (via script) jobs after the data was staged.
Most of the transfer wallclock is in network wait, so don't waste dedicated
cores for it.

On Sat, Apr 3, 2021 at 4:13 PM Will Dennis  wrote:

> What I mean by “scratch” space is indeed local persistent storage in our
> case; sorry if my use of “scratch space” is already a generally-known Slurm
> concept I don’t understand, or something like /tmp… That’s why my desired
> workflow is to “copy data locally / use data from copy / remove local copy”
> in separate steps.
>
>
>
>
>
> *From: *slurm-users  on behalf of
> Fulcomer, Samuel 
> *Date: *Saturday, April 3, 2021 at 4:00 PM
> *To: *Slurm User Community List 
> *Subject: *Re: [slurm-users] Staging data on the nodes one will be
> processing on via sbatch
>
> […]
>
> The best current workflow is to stage data into fast local persistent
> storage, and then to schedule jobs, or schedule a job that does it
> synchronously (TImeLimit=Stage+Compute). The latter is pretty unsocial and
> wastes cycles.
>
> […]
>


Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Fulcomer, Samuel
Unfortunately this is not a good workflow.

You would submit a staging job with a dependency for the compute job;
however, in the meantime, the scheduler might launch higher-priority jobs
that would want the scratch space, and cause it to be scrubbed.

In a rational process, the scratch space would be scrubbed for the
higher-priority jobs. I'm now thinking of a way that the scheduler
could consider data turds left by previous jobs, but that's not currently a
scheduling feature in SLURM multi-factor or any other scheduler I know.

The best current workflow is to stage data into fast local persistent
storage, and then to schedule jobs, or schedule a job that does it
synchronously (TImeLimit=Stage+Compute). The latter is pretty unsocial and
wastes cycles.

On Sat, Apr 3, 2021 at 3:45 PM Will Dennis  wrote:

> Hi all,
>
>
>
> We have various NFS servers that contain the data that our researchers
> want to process. These are mounted on our Slurm clusters on well-known
> paths. Also, the nodes have local fast scratch disk on another well-known
> path. We do not have any distributed file systems in use (Our Slurm
> clusters are basically just collections of hetero nodes of differing types,
> not a traditional HPC setup by any means.)
>
>
>
> In most cases, the researchers can process the data directly off the NFS
> mounts without it causing any issues, but in some cases, this slows down
> the computation unacceptably. They could manually copy the data to the
> local drive using an allocation & srun commands, but I am wondering if
> there is a way to do this in sbatch?
>
>
>
> I tried this method:
>
>
>
> wdennis@submit01 ~> sbatch transfer.sbatch
>
> Submitted batch job 329572
>
> wdennis@submit01 ~> sbatch --dependency=afterok:329572 test_job.sbatch
>
> Submitted batch job 329573
>
> wdennis@submit01 ~>  sbatch --dependency=afterok:329573 rm_data.sbatch
>
> Submitted batch job 329574
>
> wdennis@submit01 ~>
>
> wdennis@submit01 ~> squeue
>
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>
> 329573   gpu wdennis_  wdennis PD   0:00  1
> (Dependency)
>
> 329574   gpu wdennis_  wdennis PD   0:00  1
> (Dependency)
>
> 329572   gpu wdennis_  wdennis  R   0:23  1
> compute-gpu02
>
>
>
> But it seems to not preserve the node allocated with the --dependency jobs:
>
>
>
>
> JobID|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|State|ExitCode|AllocTRES|
>
>
> 329572|wdennis_data_transfer|wdennis|gpu|compute-gpu02|1|2Gc|00:02:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
>
>
> 329573|wdennis_compute_job|wdennis|gpu|compute-gpu05|1|128Gn|00:03:00|normal|COMPLETED|0:0|cpu=1,mem=128G,node=1,gres/gpu=1|
>
>
> 329574|wdennis_data_removal|wdennis|gpu|compute-gpu02|1|2Gc|00:00:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
>
>
>
> What is the best way to do something like “stage the data on a local path
> / run computation using the local copy / remove the locally staged data
> when complete”?
>
>
>
> Thanks!
>
> Will
>


Re: [slurm-users] Parent account in AllowAccounts

2021-01-15 Thread Fulcomer, Samuel
Durai,

There is no inheritance in "AllowAccounts". You need to specify each
account explicitly.

There _is_ inheritance in fairshare calculation.

On Fri, Jan 15, 2021 at 2:17 PM Brian Andrus  wrote:

> As I understand it, the parents are really meant for reporting, so you
> can run reports that aggregate the usage among children. Useful for a
> chargeback model.
>
> As far as permissions, that is on a per account basis, regardless of
> hierarchy.
>
> Just because a parent can go to the bar, doesn't mean their child can :)
>
> Brian Andrus
>
> On 1/15/2021 6:38 AM, Durai Arasan wrote:
> > Hi,
> > As you know for each partition you can specify
> > AllowAccounts=account1,account2...
> >
> > I have a parent account say "parent1" with two child accounts "child1"
> > and "child2"
> >
> > I expected that setting AllowAccounts=parent1 will allow
> > parent1,child1, and child2 to submit jobs to that partition. But
> > unfortunately only parent1 is able to submit jobs.
> >
> > For parent1,child1 and child2 to submit jobs I have to specify all
> > accounts individually:
> > AllowAccounts=parent1,child1,child2
> >
> > Am I doing something wrong or is this the way slurm is set up? Does it
> > not make sense that when a parent account is added then the child
> > accounts should automatically also be able to submit jobs to that
> > partition?
> >
> > Thanks,
> > Durai
> >
> >
>
>


Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
Also note that there was a bug in an older version of SLURM
(pre-17-something) that corrupted the database in a way that prevented
GPU/gres fencing. If that affected you and you're still using the same
database, GPU fencing probably isn't working. There's a way of fixing this
manually through sql hacking; however, we just went with a virgin database
when we last upgraded in order to get it working (and sucked the accounting
data into XDMoD).



On Thu, Jan 14, 2021 at 6:36 PM Fulcomer, Samuel 
wrote:

> AllowedDevicesFile should not be necessary. The relevant devices are
> identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.
>
> nvidia-smi will only see the allocated GPUs. Note that a single allocated
> GPU will always be shown by nvidia-smi to be GPU 0, regardless of its
> actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value
> of SLURM_STEP_GPUS will be set to the actual device number (N, where the
> device is /dev/nvidiaN).
>
> On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski 
> wrote:
>
>> AFAIK, if you have this set up correctly, nvidia-smi will be restricted
>> too, though I think we were seeing a bug there at one time in this version.
>>
>> --
>> #BlackLivesMatter
>> 
>> || \\UTGERS,
>> |---*O*---
>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> Campus
>> ||  \\of NJ | Office of Advanced Research Computing - MSB C630,
>> Newark
>> `'
>>
>> On Jan 14, 2021, at 18:05, Abhiram Chintangal 
>> wrote:
>>
>> 
>> Sean,
>>
>> Thanks for the clarification.I noticed that I am missing the
>> "AllowedDevices" option in mine. After adding this, the GPU allocations
>> started working. (Slurm version 18.08.8)
>>
>> I was also incorrectly using "nvidia-smi" as a check.
>>
>> Regards,
>>
>> Abhiram
>>
>> On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby 
>> wrote:
>>
>>> Hi Abhiram,
>>>
>>> You need to configure cgroup.conf to constrain the devices a job has
>>> access to. See https://slurm.schedmd.com/cgroup.conf.html
>>>
>>> My cgroup.conf is
>>>
>>> CgroupAutomount=yes
>>>
>>> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"
>>>
>>> ConstrainCores=yes
>>> ConstrainRAMSpace=yes
>>> ConstrainSwapSpace=yes
>>> ConstrainDevices=yes
>>>
>>> TaskAffinity=no
>>>
>>> CgroupMountpoint=/sys/fs/cgroup
>>>
>>> The ConstrainDevices=yes is the key to stopping jobs from having access
>>> to GPUs they didn't request.
>>>
>>> Sean
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <
>>> achintan...@berkeley.edu> wrote:
>>>
>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>> impersonation attempts *
>>>> --
>>>> Hello,
>>>>
>>>> I recently set up a small cluster at work using Warewulf/Slurm.
>>>> Currently, I am not able to get the scheduler to
>>>> work well with GPU's (Gres).
>>>>
>>>> While slurm is able to filter by GPU type, it allocates all the GPU's
>>>> on the node. See below:
>>>>
>>>> [abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu
>>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>>> index, name
>>>>> 0, Tesla P100-PCIE-16GB
>>>>> 1, Tesla P100-PCIE-16GB
>>>>> 2, Tesla P100-PCIE-16GB
>>>>> 3, Tesla P100-PCIE-16GB
>>>>> [abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu
>>>>> nvidia-smi --query-gpu=index,name --format=csv
>>>>> index, name
>>>>> 0, TITAN RTX
>>>>> 1, TITAN RTX
>>>>> 2, TITAN RTX
>>>>> 3, TITAN RTX
>>>>> 4, TITAN RTX
>>>>> 5, TITAN RTX
>>>>> 6, TITAN RTX
>>>>> 7, TITAN RTX
>>>>>
>>>>
>>>> I am fairly new to Slurm and still figuring out my way around it. I
>>>> would really appreciate any help with this.
>>>>
>>>> For your reference, I attached the slurm.conf and gres.conf files.
>>>>
>>>> Best,
>>>>
>>>> Abhiram
>>>>
>>>> --
>>>>
>>>> Abhiram Chintangal
>>>> QB3 Nogales Lab
>>>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>>>> University of California Berkeley
>>>> 708D Stanley Hall, Berkeley, CA 94720
>>>> Phone (510)666-3344
>>>>
>>>>
>>
>> --
>>
>> Abhiram Chintangal
>> QB3 Nogales Lab
>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>> University of California Berkeley
>> 708D Stanley Hall, Berkeley, CA 94720
>> Phone (510)666-3344
>>
>>


Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
AllowedDevicesFile should not be necessary. The relevant devices are
identified in gres.conf. "ConstrainDevices=yes" should be all that's needed.

nvidia-smi will only see the allocated GPUs. Note that a single allocated
GPU will always be shown by nvidia-smi to be GPU 0, regardless of its
actual hardware ordinal, and GPU_DEVICE_ORDINAL will be set to 0. The value
of SLURM_STEP_GPUS will be set to the actual device number (N, where the
device is /dev/nvidiaN).

On Thu, Jan 14, 2021 at 6:20 PM Ryan Novosielski 
wrote:

> AFAIK, if you have this set up correctly, nvidia-smi will be restricted
> too, though I think we were seeing a bug there at one time in this version.
>
> --
> #BlackLivesMatter
> 
> || \\UTGERS,
> |---*O*---
> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ | Office of Advanced Research Computing - MSB C630,
> Newark
> `'
>
> On Jan 14, 2021, at 18:05, Abhiram Chintangal 
> wrote:
>
> 
> Sean,
>
> Thanks for the clarification.I noticed that I am missing the
> "AllowedDevices" option in mine. After adding this, the GPU allocations
> started working. (Slurm version 18.08.8)
>
> I was also incorrectly using "nvidia-smi" as a check.
>
> Regards,
>
> Abhiram
>
> On Thu, Jan 14, 2021 at 12:22 AM Sean Crosby 
> wrote:
>
>> Hi Abhiram,
>>
>> You need to configure cgroup.conf to constrain the devices a job has
>> access to. See https://slurm.schedmd.com/cgroup.conf.html
>>
>> My cgroup.conf is
>>
>> CgroupAutomount=yes
>> AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf"
>>
>> ConstrainCores=yes
>> ConstrainRAMSpace=yes
>> ConstrainSwapSpace=yes
>> ConstrainDevices=yes
>>
>> TaskAffinity=no
>>
>> CgroupMountpoint=/sys/fs/cgroup
>>
>> The ConstrainDevices=yes is the key to stopping jobs from having access
>> to GPUs they didn't request.
>>
>> Sean
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>> On Thu, 14 Jan 2021 at 18:36, Abhiram Chintangal <
>> achintan...@berkeley.edu> wrote:
>>
>>> * UoM notice: External email. Be cautious of links, attachments, or
>>> impersonation attempts *
>>> --
>>> Hello,
>>>
>>> I recently set up a small cluster at work using Warewulf/Slurm.
>>> Currently, I am not able to get the scheduler to
>>> work well with GPU's (Gres).
>>>
>>> While slurm is able to filter by GPU type, it allocates all the GPU's on
>>> the node. See below:
>>>
>>> [abhiram@whale ~]$ srun --gres=gpu:p100:2 -n 1 --partition=gpu
 nvidia-smi --query-gpu=index,name --format=csv
 index, name
 0, Tesla P100-PCIE-16GB
 1, Tesla P100-PCIE-16GB
 2, Tesla P100-PCIE-16GB
 3, Tesla P100-PCIE-16GB
 [abhiram@whale ~]$ srun --gres=gpu:titanrtx:2 -n 1 --partition=gpu
 nvidia-smi --query-gpu=index,name --format=csv
 index, name
 0, TITAN RTX
 1, TITAN RTX
 2, TITAN RTX
 3, TITAN RTX
 4, TITAN RTX
 5, TITAN RTX
 6, TITAN RTX
 7, TITAN RTX

>>>
>>> I am fairly new to Slurm and still figuring out my way around it. I
>>> would really appreciate any help with this.
>>>
>>> For your reference, I attached the slurm.conf and gres.conf files.
>>>
>>> Best,
>>>
>>> Abhiram
>>>
>>> --
>>>
>>> Abhiram Chintangal
>>> QB3 Nogales Lab
>>> Bioinformatics Specialist @ Howard Hughes Medical Institute
>>> University of California Berkeley
>>> 708D Stanley Hall, Berkeley, CA 94720
>>> Phone (510)666-3344
>>>
>>>
>
> --
>
> Abhiram Chintangal
> QB3 Nogales Lab
> Bioinformatics Specialist @ Howard Hughes Medical Institute
> University of California Berkeley
> 708D Stanley Hall, Berkeley, CA 94720
> Phone (510)666-3344
>
>


Re: [slurm-users] trying to add gres

2021-01-05 Thread Fulcomer, Samuel
Important notes...

If requesting more than one core and not using "-N 1", equal numbers of
GPUs will be allocated on each node where the cores are allocated. (i.e. if
requesting 1 GPU for a 2-core job, if one core is allocated on each of two
nodes, one GPU will be allocated on each node).

If you are running node exclusive, all GPUs on the node will be allocated
to the job, regardless of how many are used.






On Tue, Jan 5, 2021 at 7:30 PM Erik Bryer  wrote:

> I made the gres.conf the same on both nodes and Slurm started without
> error. I'm now seeing another error.
>
> There are 4 GPUs defined per node. If I start 2 jobs with
> #SBATCH --gpus=foolsgold:4
> it runs one job in each of the 2 nodes. If I scancel those and run 4 jobs
> with the script reading
> #SBATCH --gpus=foolsgold:1
> I get 2 queued and 2 running jobs. It seems allocating 1 gpu allocates all
> 4, not just 1. But why would this be so?
>
> Thanks,
> Erik
> --
> *From:* slurm-users  on behalf of
> Chris Samuel 
> *Sent:* Thursday, December 24, 2020 5:44 PM
> *To:* slurm-users@lists.schedmd.com 
> *Subject:* Re: [slurm-users] trying to add gres
>
> On 24/12/20 4:42 pm, Erik Bryer wrote:
>
> > I made sure my slurm.conf is synchronized across machines. My intention
> > is to add some arbitrary gres for testing purposes.
>
> Did you update your gres.conf on all the nodes to match?
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>


Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Fulcomer, Samuel
Our strategy is a bit simpler. We're migrating compute nodes to a new
cluster running 20.x. This isn't an upgrade. We'll keep the old slurmdbd
running for at least enough time to suck the remaining accounting data into
XDMoD.

The old cluster will keep running jobs until there are no more to run.
We'll drain and move nodes to the new cluster as we start seeing more and
more idle nodes in the old cluster. This avoids MPI ugliness and we move
directly to 20.x.



On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon  wrote:

> In general  I would follow this:
>
> https://slurm.schedmd.com/quickstart_admin.html#upgrade
>
> Namely:
>
> Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x) involves
> changes to the state files with new data structures, new options, etc.
> Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 18.08.x or 19.05.x to
> 20.02.x) without loss of jobs or other state information. State information
> from older versions will not be recognized and will be discarded, resulting
> in loss of all running and pending jobs. State files are *not* recognized
> when downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
> resulting in loss of all running and pending jobs. For this reason,
> creating backup copies of state files (as described below) can be of value.
> Therefore when upgrading Slurm (more precisely, the slurmctld daemon),
> saving the *StateSaveLocation* (as defined in *slurm.conf*) directory
> contents with all state information is recommended. If you need to
> downgrade, restoring that directory's contents will let you recover the
> jobs. Jobs submitted under the new version will not be in those state
> files, but it can let you recover most jobs. An exception to this is that
> jobs may be lost when installing new pre-release versions (e.g.
> 20.02.0-pre1 to 20.02.0-pre2). Developers will try to note these cases in
> the NEWS file. Contents of major releases are also described in the
> RELEASE_NOTES file.
>
> So I wouldn't go directly to 20.x, instead I would go from 17.x to 19.x
> and then to 20.x
>
> -Paul Edmon-
> On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
>
> We're doing something similar. We're continuing to run production on 17.x
> and have set up a new server/cluster  running 20.x for testing and MPI app
> rebuilds.
>
> Our plan had been to add recently purchased nodes to the new cluster, and
> at some point turn off submission on the old cluster and switch everyone
> to  submission on the new cluster (new login/submission hosts). That way
> previously submitted MPI apps would continue to run properly. As the old
> cluster partitions started to clear out we'd mark ranges of nodes to drain
> and move them to the new cluster.
>
> We've since decided to wait until January, when we've scheduled some
> downtime. The process will remain the same wrt moving nodes from the old
> cluster to the new, _except_ that everything will be drained, so we can
> move big blocks of nodes and avoid slurm.conf Partition line ugliness.
>
> We're starting with a fresh database to get rid of the bug
> induced corruption that prevents GPUs from being fenced with cgroups.
>
> regards,
> s
>
> On Mon, Nov 2, 2020 at 8:28 AM navin srivastava 
> wrote:
>
>> Dear All,
>>
>> Currently we are running slurm version 17.11.x and wanted to move to 20.x.
>>
>> We are building the New server with Slurm 20.2 version and planning to
>> upgrade the client nodes from 17.x to 20.x.
>>
>> wanted to check if we can upgrade the Client from 17.x to 20.x directly
>> or we need to go through 17.x to 18.x and 19.x then 20.x
>>
>> Regards
>> Navin.
>>
>>
>>
>>


Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Fulcomer, Samuel
We're doing something similar. We're continuing to run production on 17.x
and have set up a new server/cluster  running 20.x for testing and MPI app
rebuilds.

Our plan had been to add recently purchased nodes to the new cluster, and
at some point turn off submission on the old cluster and switch everyone
to  submission on the new cluster (new login/submission hosts). That way
previously submitted MPI apps would continue to run properly. As the old
cluster partitions started to clear out we'd mark ranges of nodes to drain
and move them to the new cluster.

We've since decided to wait until January, when we've scheduled some
downtime. The process will remain the same wrt moving nodes from the old
cluster to the new, _except_ that everything will be drained, so we can
move big blocks of nodes and avoid slurm.conf Partition line ugliness.

We're starting with a fresh database to get rid of the bug
induced corruption that prevents GPUs from being fenced with cgroups.

regards,
s

On Mon, Nov 2, 2020 at 8:28 AM navin srivastava 
wrote:

> Dear All,
>
> Currently we are running slurm version 17.11.x and wanted to move to 20.x.
>
> We are building the New server with Slurm 20.2 version and planning to
> upgrade the client nodes from 17.x to 20.x.
>
> wanted to check if we can upgrade the Client from 17.x to 20.x directly or
> we need to go through 17.x to 18.x and 19.x then 20.x
>
> Regards
> Navin.
>
>
>
>


Re: [slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2020-10-22 Thread Fulcomer, Samuel
Compile slurm without ucx support. We wound up spending quality time with
the Mellanox... wait, no, NVIDIA Networking UCX folks to get this sorted
out.

I recommend using SLURM 20 rather than 19.

regards,
s



On Thu, Oct 22, 2020 at 10:23 AM Michael Di Domenico 
wrote:

> was there ever a result to this?  i'm seeing the same error message,
> but i'm not adding in all the environ flags like the original poster.
>
> On Wed, Jul 10, 2019 at 9:18 AM Daniel Letai  wrote:
> >
> > Thank you Artem,
> >
> >
> > I've made a mistake while typing the mail, in all cases it was
> 'OMPI_MCA_pml=ucx' and not as written. When I went over the mail before
> sending, I must have erroneously 'fixed' it for some reason.
> >
> >
> > 
> >
> > Best regards,
> >
> > --Dani_L.
> >
> >
> > On 7/9/19 9:06 PM, Artem Polyakov wrote:
> >
> > Hello, Daniel
> >
> > Let me try to reproduce locally and get back to you.
> >
> > 
> > Best regards,
> > Artem Y. Polyakov, PhD
> > Senior Architect, SW
> > Mellanox Technologies
> > 
> > От: p...@googlegroups.com  от имени Daniel Letai
> 
> > Отправлено: Tuesday, July 9, 2019 3:25:22 AM
> > Кому: Slurm User Community List; p...@googlegroups.com;
> ucx-gr...@elist.ornl.gov
> > Тема: [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with
> SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error
> >
> >
> > Cross posting to Slurm, PMIx and UCX lists.
> >
> >
> > Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm
> (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in:
> >
> >
> > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix:
> ERROR: ucp_ep_create failed: Input/output error
> > slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect]
> mpi/pmix: ERROR: Cannot establish direct connection to n2 (1)
> > slurmstepd: error: n1 [0] pmixp_server.c:731 [_process_extended_hdr]
> mpi/pmix: ERROR: Unable to connect to 1
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix:
> ERROR: ucp_ep_create failed: Input/output error
> > slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect]
> mpi/pmix: ERROR: Cannot establish direct connection to n1 (0)
> > slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT
> 2019-07-01T13:20:36 ***
> > slurmstepd: error: n2 [1] pmixp_server.c:731 [_process_extended_hdr]
> mpi/pmix: ERROR: Unable to connect to 0
> > srun: error: n2: task 1: Killed
> > srun: error: n1: task 0: Killed
> >
> >
> > However, the following works:
> >
> >
> > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > n2: Process 1 out of 2
> > n1: Process 0 out of 2
> >
> >
> > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
> SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
> OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
> SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export
> SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
> UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2
> /data/mpihello/mpihello
> >
> >
> > n2: Process 1 out of 2
> > n1: Process 0 out of 2
> >
> >
> > Executing mpirun directly (same env vars, without the slurm vars) works,
> so UCX appears to function correctly.
> >
> >
> > If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
> SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout errors from
> mellanox/hcoll and glibc detected /data/mpihello/mpihello: malloc(): memory
> corruption (fast)
> >
> >
> > Can anyone help using PMIx direct connection with UCX in Slurm?
> >
> >
> >
> >
> > Some info about my setup:
> >
> >
> > UCX version
> >
> > [root@n1 ~]# ucx_info -v
> >
> > # UCT version=1.5.0 revision 02078b9
> > # configured with: --build=x86_64-redhat-linux-gnu
> --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu
> --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
> --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
> --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec
> --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man
> --infodir=/usr/share/info 

Re: [slurm-users] GRES Restrictions

2020-08-25 Thread Fulcomer, Samuel
cgroups should work correctly _if_ you're not running with an old corrupted
slurm database.

There was a bug in a much earlier version of slurm that corrupted the
database in a way that the cgroups/accounting code could no longer fence
GPUs. This was fixed in a later version, but the database corruption
carries forward.

Apparently the db can be fixed manually, but we're just starting with a new
install and fresh db.

On Tue, Aug 25, 2020 at 11:03 AM Ryan Novosielski 
wrote:

> Sorry about that. “NJT” should have read “but;” apparently my phone
> decided I was talking about our local transit authority. 
>
> On Aug 25, 2020, at 10:30, Ryan Novosielski  wrote:
>
>  I believe that’s done via a QoS on the partition. Have a look at the
> docs there, and I think “require” is a good key word to look for.
>
> Cgroups should also help with this, NJT I’ve been troubleshooting a
> problem where that seems not to be working correctly.
>
> --
> 
> || \\UTGERS,
> |---*O*---
> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ | Office of Advanced Research Computing - MSB C630,
> Newark
> `'
>
> On Aug 25, 2020, at 10:13, Willy Markuske  wrote:
>
> 
>
> Hello,
>
> I'm trying to restrict access to gpu resources on a cluster I maintain for
> a research group. There are two nodes put into a partition with gres gpu
> resources defined. User can access these resources by submitting their job
> under the gpu partition and defining a gres=gpu.
>
> When a user includes the flag --gres=gpu:# they are allocated the number
> of gpus and slurm properly allocates them. If a user requests only 1 gpu
> they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not include
> the --gres=gpu:# flag they can still submit a job to the partition and are
> then able to see all the GPUs. This has led to some bad actors running jobs
> on all GPUs that other users have allocated and causing OOM errors on the
> gpus.
>
> Is it possible, and where would I find the documentation on doing so, to
> require users to define a --gres=gpu:# to be able to submit to a partition?
> So far reading the gres documentation doesn't seem to have yielded any word
> on this issue specifically.
>
> Regards,
> --
>
> Willy Markuske
>
> HPC Systems Engineer
> 
>
> Research Data Services
>
> P: (858) 246-5593
>
>


Re: [slurm-users] [External] Defining a default --nodes=1

2020-05-08 Thread Fulcomer, Samuel
"-N 1" restricts a job to a single node.

We've continued to have issues with this. Historically we've had a single
partition with multiple generations of nodes segregated for
multinode scheduling via topology.conf. "Use -N 1" (unless you really know
what you're doing) only goes so far.

There are a few other things that we may address in a submission plugin,
and this will be added.

As previously mentioned, the other approach is to have a partition limited
to single-node scheduling (as the default partition), and advise users who
really know how to do multi-node process management (MPI or other) to use
an overlapping partition.



On Fri, May 8, 2020 at 4:10 PM Michael Robbert  wrote:

> Manuel,
>
> You may want to instruct your users to use ‘-c’ or ‘—cpus-per-task’ to
> define the number of cpus that they need. Please correct me if I’m wrong,
> but I believe that will restrict the jobs to a singe node whereas ‘-n’ or
> ‘—ntasks’ is really for multi process jobs which can be spread amongst
> multiple nodes.
>
>
>
> Mike
>
>
>
> *From: *slurm-users  on behalf of
> "Holtgrewe, Manuel" 
> *Reply-To: *Slurm User Community List 
> *Date: *Friday, May 8, 2020 at 03:28
> *To: *"slurm-users@lists.schedmd.com" 
> *Subject: *[External] [slurm-users] Defining a default --nodes=1
>
>
>
> *CAUTION:* This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless you
> recognize the sender and know the content is safe.
>
>
>
> Dear all,
>
>
>
> we're running a cluster where the large majority of jobs will use
> multi-threading and no message passing. Sometimes CPU>1 jobs are scheduled
> to run on more than one node (which would be fine for MPI jobs of course...)
>
>
>
> Is it possible to automatically set "--nodes=1" for all jobs outside of
> the "mpi" partition (that we setup for message passing jobs)?
>
>
>
> Thank you,
>
> Manuel
>
>
>
> --
> Dr. Manuel Holtgrewe, Dipl.-Inform.
> Bioinformatician
> Core Unit Bioinformatics – CUBI
> Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in
> the Helmholtz Association / Charité – Universitätsmedizin Berlin
>
> Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
> Postal Address: Chariteplatz 1, 10117 Berlin
>
> E-Mail: manuel.holtgr...@bihealth.de
> Phone: +49 30 450 543 607
> Fax: +49 30 450 7 543 901
> Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de
> www.charite.de
>


Re: [slurm-users] Managing Local Scratch/TmpDisk

2020-03-31 Thread Fulcomer, Samuel
If you use cgroups, tmpfs /tmp and /dev/shm usage is counted against the
requested memory for the job.

On Tue, Mar 31, 2020 at 4:51 PM Ellestad, Erik 
wrote:

> How are folks managing allocation of local TmpDisk for jobs?
>
> We see how you define the location of TmpFs in slurm.conf.
>
> And then how the amount per host is defined via TmpDisk.
>
> Then the request for srun/sbatch via --tmp=X
>
> However, it appears SLURM only checks the defined TmpDisk amount when
> allocating tmp, not the actual space available on disk.
>
> That is to say, for example, if during the course of its run, a job uses
> more of the TmpDisk space than it requests initially, SLURM doesn't check
> the actual disk space available for other jobs.
>
> The only notification that you will get is when your jobs run out of space
> or the defined TmpFs is full.
>
> Thanks!
>
> Erik
>
> ---
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
>


Re: [slurm-users] Heterogeneous HPC

2019-09-20 Thread Fulcomer, Samuel
Thanks! and I'll watch the video...

Privileged containers! never!

On Thu, Sep 19, 2019 at 9:06 PM Michael Jennings  wrote:

> On Thursday, 19 September 2019, at 19:27:38 (-0400),
> Fulcomer, Samuel wrote:
>
> > I obviously haven't been keeping up with any security concerns over the
> use
> > of Singularity. In a 2-3 sentence nutshell, what are they?
>
> So before I do that, if you have a few minutes, I do think you'll find
> it worth your time to go to https://youtu.be/H6VrjowOOF4?t=2361 (it'll
> start about 39 minutes in) and watch at least those next 8 or so minutes.
> I go into some detail about the security track records of multiple
> container runtimes and provide factual data so that folks can make their
> own risk assessments rather than just giving my personal opinion.  (The
> video does cut off the right side of the slides, but the slide deck is
> available at
> https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-19-22663
> for anyone interested.)
>
> If you really don't want to watch the video, though, I can provide a few
> of the data points.
>
> First off, if you have not read it before, you really should read
> Matthias Gerstner's assessment after doing a code review and security
> audit on Singularity 2.6.0 to see if it could be packaged for SuSE:
> https://www.openwall.com/lists/oss-security/2018/12/12/2
> The quotes I used on the slide for my talk came from comments he made in
> the linked SuSE Bugzilla bug -- which, for unknown reasons, was
> re-locked by SuSE after previously being unlocked once the bug report
> was public! -- regarding whether or not, and under what constraints, to
> include and support Singularity on SuSE.  Matthias is a widely respected
> security expert in the OSS community, so I trust his assessment and
> insight.  And his audit alone found 5 or 6 CVE-worthy vulnerabilities at
> once.
>
> Additionally, as I mentioned in the video, during the 3-year period
> 2016-2018, there were at least 17 different vulnerabilities found in
> Singularity.  Also, of the 9 releases they did during 2018, 7 of those
> were security releases to fix vulnerabilities (and frequently more than
> 1 at a time).  That's...not great.  Especially in an environment like
> ours where saying "security is important" is an understatement of
> nuclear proportions! ;-)
>
> And finally, while we were hopeful that the rewrite in Go (version 3.0
> and above) would correct the security failings in the code, there've
> already been multiple serious vulnerabilities (all grouped together
> under a single CVE identifier, CVE-2019-11328), at least one of which
> was essentially a replica of one of the flaws fixed in 2.6.0 under
> CVE-2018-12021!  And you don't need to take my word for it, either:
> https://www.openwall.com/lists/oss-security/2019/05/16/1
>
> It's hard to say if the above trend will continue...but not all sites
> can afford to take those kinds of risks.
>
> And while Shifter's security track record is spotless to date, I would
> still summarize the overall lesson to be learned as, "Don't use
> privileged container runtimes.  Use user namespaces.  That's what
> they're there for."  And before anyone yells at me, yes I know
> Singularity advertises user namespace support and non-setuid operation.
> But it doesn't seem to be very widely used or adequately exercised, and
> AFAICT the default mode of operation in both RPMs and build-from-src is
> via setuid binaries.  So using a natively unprivileged runtime still
> seems the less risky choice, in my personal assessment.
>
> Yes, I know that was more than a "2-3 sentence nutshell," but hopefully
> it was helpful anyway! :-)
>
> Michael
>
> --
> Michael E. Jennings 
> HPC Systems Team, Los Alamos National Laboratory
> Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
>


Re: [slurm-users] Heterogeneous HPC

2019-09-19 Thread Fulcomer, Samuel
Hey Michael,

I obviously haven't been keeping up with any security concerns over the use
of Singularity. In a 2-3 sentence nutshell, what are they?

I've been annoyed by NVIDIA's docker distribution for DGX-1 & friends.

We've been setting up an ersatz-secure SIngularity environment for use of
mid-range DUA data like dbGaP.

Regards,
Sam



On Thu, Sep 19, 2019 at 4:38 PM Michael Jennings  wrote:

> On Friday, 20 September 2019, at 00:03:28 (+0430),
> Mahmood Naderan wrote:
>
> > For the replies. Matlab was an example. I would also like to create
> > to containers for OpenFoam with different versions. Then a user can
> > choose what he actually wants.
>
> All modern container runtimes support the OCI standard container
> format originally authored by Docker, Inc. and contributed to the Open
> Container Initiative (OCI) as the starting point for their standard.
> So your best bet would be to go to Docker Hub (hub.docker.com) and
> search for the applications you're interested in, or (in the case of
> commercial software) ask your vendor if they supply containers for
> their packages and under what terms.
>
> If you're comfortable with building as root, you can likely build your
> own containers without too much trouble, but in order to build
> containers without privilege, you'll need very recent Podman/Buildah
> (or current Charliecloud plus Spokeo and umoci, if your Dockerfile is
> supported by ch-grow).
>
> > I would also like to know, if the technologies you mentioned can be
> > deployed in multinode clusters. Currently, we use Rocks 7. Should I
> > install singularity (or others) on all nodes or just the frontend?
> > And then, can users use "srun" or "salloc" for interactively login
> > to a node and run the container or not?
>
> Most folks invoke the container runtime using srun, either in their
> job script or as part of an interactive session.  There are several
> examples in the Charliecloud docs, for example, here:
>
> https://hpc.github.io/charliecloud/tutorial.html#your-first-single-node-multi-process-jobs
>
> But yes, you will likely need the container runtime installed on every
> node.  Most large HPC centers use Slurm, so you should have no problem
> getting any or all of them to integrate well with your existing Slurm
> installation. :-)
>
> That said, I *do* recommend watching at least that last video before
> you make your final decision on runtime.  With containers, as with any
> technology, you're far more likely to get factual information from
> folks who aren't trying to sell something! ;-)
>
> Having personally deployed, tested, and evaluated over a dozen
> different container solutions -- including every major HPC container
> system as well as implementing a few of my own -- I can tell you with
> absolute certainty that there's no single right answer to "What
> container system should I use?"  There are several correct answers
> depending on your use case and security & UX requirements.
>
> Michael
>
> --
> Michael E. Jennings 
> HPC Systems Team, Los Alamos National Laboratory
> Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
>
>


Re: [slurm-users] Substituions for "see META file" in slurm.spec file of 15.08.11-1 release

2019-07-09 Thread Fulcomer, Samuel
...and for the SchedMD folks, it would be a lot simpler to
drop/disambiguate the "year it was released" first element in the version
number, and just use it as an incrementing major version number.


On Tue, Jul 9, 2019 at 6:42 PM Fulcomer, Samuel 
wrote:

> Hi Pariksheet,
>
> To confirm, "14", "15", "16", and "17" do not denote major versions. For
> example, "17.02" and "17.11" are different major versions. Only "MM.NN"
> denotes a major version. This is somewhat unintuitive, and I've suggested
> some documentation clarification, but it's still somewhat easily missed.
>
> Regards,
> Sam
>
>
>
> On Tue, Jul 9, 2019 at 6:23 PM Pariksheet Nanda <
> pariksheet.na...@gmail.com> wrote:
>
>> Hi Samuel,
>>
>> On Mon, Jul 8, 2019 at 8:19 PM Fulcomer, Samuel <
>> samuel_fulco...@brown.edu> wrote:
>> >
>> > The underlying issue is database schema compatibility/regression. Each
>> upgrade is only intended to provided capability to successfully upgrade the
>> schema from two versions back.
>> --snip--
>> > ...and you should follow the upgrade instructions on schedmd.com. Note
>> that you need to start the slurmdbd before the slurmctld, and be patient
>> while slurmdbd updates the schema.
>>
>> Thanks for taking the time to share this warning and your experiences!
>> I'm familiar with the the limitation of hopping no further than 2 releases
>> at a time due to the DB schema changes and should have mentioned my
>> awareness of that in my original e-mail to not give good Samaritans like
>> you panic attacks.  So sorry for that omission on my part!
>>
>> Past upgrades have been eventful.  I orchestrated our upgrade from SLURM
>> 14 to 15 in May of 2016, and a previous administrator did the upgrade from
>> some prior version to 14.  In my case, for some reason running `make
>> install` omitted installing 2 compiled libraries from the .lib/plugins/
>> directory to the filesystem.  There were also other idiosyncrasies that
>> would have added a lot more time and stress to the outage had I not tried
>> simulating the upgrade first.  It's possible that others on this list have
>> seamless upgrade experiences, but that's the baggage I now carry around.
>>
>>
>> > regards,
>> > s
>>
>> Pariksheet
>>
>>


Re: [slurm-users] Substituions for "see META file" in slurm.spec file of 15.08.11-1 release

2019-07-09 Thread Fulcomer, Samuel
Hi Pariksheet,

To confirm, "14", "15", "16", and "17" do not denote major versions. For
example, "17.02" and "17.11" are different major versions. Only "MM.NN"
denotes a major version. This is somewhat unintuitive, and I've suggested
some documentation clarification, but it's still somewhat easily missed.

Regards,
Sam



On Tue, Jul 9, 2019 at 6:23 PM Pariksheet Nanda 
wrote:

> Hi Samuel,
>
> On Mon, Jul 8, 2019 at 8:19 PM Fulcomer, Samuel 
> wrote:
> >
> > The underlying issue is database schema compatibility/regression. Each
> upgrade is only intended to provided capability to successfully upgrade the
> schema from two versions back.
> --snip--
> > ...and you should follow the upgrade instructions on schedmd.com. Note
> that you need to start the slurmdbd before the slurmctld, and be patient
> while slurmdbd updates the schema.
>
> Thanks for taking the time to share this warning and your experiences!
> I'm familiar with the the limitation of hopping no further than 2 releases
> at a time due to the DB schema changes and should have mentioned my
> awareness of that in my original e-mail to not give good Samaritans like
> you panic attacks.  So sorry for that omission on my part!
>
> Past upgrades have been eventful.  I orchestrated our upgrade from SLURM
> 14 to 15 in May of 2016, and a previous administrator did the upgrade from
> some prior version to 14.  In my case, for some reason running `make
> install` omitted installing 2 compiled libraries from the .lib/plugins/
> directory to the filesystem.  There were also other idiosyncrasies that
> would have added a lot more time and stress to the outage had I not tried
> simulating the upgrade first.  It's possible that others on this list have
> seamless upgrade experiences, but that's the baggage I now carry around.
>
>
> > regards,
> > s
>
> Pariksheet
>
>


Re: [slurm-users] Substituions for "see META file" in slurm.spec file of 15.08.11-1 release

2019-07-08 Thread Fulcomer, Samuel
Hi Pariksheet,

Note that an "upgrade", in the sense that retained information is converted
to new formats, is only relevant for the slurmctld/slurmdbd  (and backup)
node.

If you're planning downtime in which you quiesce job execution (i.e.,
schedule a maintenance reservation), and have image configs to change for
the slurmd/worker nodes, you can just go ahead and bring up the worker node
with the latest stable v19 (NB, this assumes that they do not have running
jobs) _after_ you bring up the controller/dbd nodes.

For the controller/dbd nodes, the imperative important consideration is
that you upgrade by no more than 2 releases at a time. We no longer have
any 15.XX tar balls available. Note that a major release is denoted by NN.MM,
where any difference in NN or MM is a different major release. For example,
17.02.xx and 17.11.xx reference two different major releases (17.02 and
17.11). This bit me before when upgrading from 15.MM to 17.11.xx.

The underlying issue is database schema compatibility/regression. Each
upgrade is only intended to provided capability to successfully upgrade the
schema from two versions back.

So... what you need to do is find the major release versions following
15.08, e.g:

1. 15.08,
2. ?? (We've got a 16.05 tar ball)
3. ?? (probably 17.02)

That will get you into the current tarball versions you can download from
schedmd. If 16.05 is the only major version between 15.08 and 17.02, you
should be able to upgrade directly to 17.02; _however_, I can't confirm
that it is. You'll need to have someone else pipe up to confirm this.

...and you should follow the upgrade instructions on schedmd.com. Note that
you need to start the slurmdbd before the slurmctld, and be patient while
slurmdbd updates the schema.

regards,
s

On Mon, Jul 8, 2019 at 3:50 PM Pariksheet Nanda 
wrote:

> Hi SLURM devs,
>
> TL;DR: What magic incantations are needed to preprocess the slurm.spec
> file in SLURM 15?
>
> Our cluster is currently running SLURM version 15.08.11.  We are planning
> some downtime to upgrade to 17 and then to 19, and in preparation for the
> upgrade I'm simulating the upgrade steps in libvirt Vagrant VMs with
> Ansible playbooks.
>
> However the issue I'm running into is using the GitHub tarball [1] the
> slurm.spec file has invalid entries:
>
>
> pan14001@becat-pan ~/src/ansible-hpc-storrs $ vagrant ssh head1
> Last login: Thu Jun  6 14:00:18 2019 from 192.168.121.1
> [vagrant@head1 ~]$ sudo su -
> [root@head1 ~]# cd /tmp/
> [root@head1 tmp]# rpmbuild -ta ~/src/slurm-15-08-11-1.tar.gz
> error: line 89: Tag takes single token only: Name:see META file
> [root@head1 tmp]# tar --strip-components=1 -xf
> ~/src/slurm-15-08-11-1.tar.gz slurm-slurm-15-08-11-1/slurm.spec
> [root@head1 tmp]# grep -F META slurm.spec
> Name:see META file
> Version: see META file
> Release: see META file
> [root@head1 tmp]#
>
>
> In the past when we installed SLURM we used the tarballs from the
> slurm.schedmd.com website which behaved differently.  But I see those
> tarballs have been removed due to the security vulnerability
> (CVE-2018-10995); all versions of Slurm prior to 17.02.11 or 17.11.7 are no
> longer available for download from the SchedMD website.
>
> Presumably there is some preprocessing step to substitute the "see META
> file" comments strings?  I'm not able to find any build automation that
> processes the slurm.spec file.
>
> Pariksheet
>
> [1] https://github.com/SchedMD/slurm/archive/slurm-15-08-11-1.tar.gz
>


Re: [slurm-users] Configure Slurm 17.11.9 in Ubuntu 18.10 with use of PMI

2019-06-20 Thread Fulcomer, Samuel
Hi Palle,

You should  probably get the latest stable SLURM version from
www.schedmd.com and use the build/install instructions found there. Note
that you should check for WARNING messages in the config.log produced by
SLURM's configure, as they're the best place to find you've missing
packages that may be useful.

When configuring OpenMPI, you'll want to use "--with-pmi=/usr/local" if you
build SLURM and install it into /usr/local. You'll probably also want
"--enable-mpi-cxx".

Regards,
Sam

On Thu, Jun 20, 2019 at 12:33 PM Pär Lundö  wrote:

> Dear all,
>
>
> I have been following this mailinglist for some time, and as a complete
> newbie using Slurm I have learned some lessons from you.
>
> I have an issue with building and configuring Slurm to use OpenMPI.
>
> When running srun for some task I get the error stating that Slurm has not
> been built or configured to use MPI and I am advised to rebuild it
> accordingly.
>
> i have taken the following steps in order to configure and build Slurm
> with OpenMPI (or PMI2, it really doesnt matter for me right now, I just
> want to have the know how on this such configuration should be made).
>
>1. Download source-code via "apt-get source slurm-llnl" (current
>version for Ubuntu 18.10 is 17.11.9)
>2. Extracted the source code from the slurm-llnl_17.11.9-1.dsc"
>3. cd to source dir
>   1. First I ran the following steps:
>  1. "./configure --with-pmi"
>  2. "debuild -i -us -uc -b" -> Fails.
>   2. I then ran the following steps (noting that the
>   "debuild-command" overwrites some configuration, thus I added 
> "--with-pmi"
>   for that case):
>  1. debuild -i -us -uc -b" -> Fails
>
>
> I followed the same procedure when configuring OpenMPI to be built with
> Slurm, which worked after some back and forth with clean commands.
>
>
> Any suggestions as to why this does not work?
>
> I must be missing out on something very basic, because Slurm must surely
> be used with Ubuntu and OpenMPI .
>
> Best regards,
>
> Palle
>


Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread Fulcomer, Samuel
...ah, got it. I was confused by "PI/Lab nodes" in your partition list.

Our QoS/account pair for each investigator condo is our approximate
equivalent of what you're doing with owned partitions.

Since we have everything in one partition we segregate processor types via
topology.conf. We break up topology.conf further to keep MPI jobs on the
same switch.

On another topic, how do you address code optimization for processor type?
We've been mostly linking with MKL and relying on its muti-code-path.

Regards,
Sam

On Thu, Jun 20, 2019 at 10:20 AM Paul Edmon  wrote:

> People will specify which partition they need or if they want multiple
> they use this:
>
> #SBATCH -p general,shared,serial_requeue
>
> As then the scheduler will just select which partition they will run in
> first.  Naturally there is a risk that you will end up running in a more
> expensive partition.
>
> Our time limit is only applied to our public partitions, our owned
> partitions (of which we have roughly 80) have no time limit.  So if they
> run on their dedicated resources they have no penalty.  We've been working
> on getting rid of owned partitions and moving to a school/department based
> partition, where all the purchased resources for different PI's go into the
> same bucket where they compete against themselves and not the wider
> community.  We've found that this ends up working pretty well as most PI's
> only used their purchased resources sporadically.  Thus there are usually
> idle cores lying around that we backfill with our serial queues.  Since
> those are requeueable we can get immediate response to access that idle
> space.  We are also toying with a high priority partition that is open to
> people with high fairshare so that they can get immediate response as those
> with high fairshare tend to be bursty users.
>
> Our current halflife is set to a month and we keep 6 months of data in our
> database.  I'd actually like to get rid of the halflife and just go to a 3
> month moving window to allow people to bank their fairshare, but we haven't
> done that yet as people have been having a hard enough time understanding
> our current system.  It's not due to its complexity but more that most
> people just flat out aren't cognizant of their usage and think the resource
> is functionally infinite.
>
> -Paul Edmon-
> On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
>
> Hi Paul,
>
> Thanks..Your setup is interesting. I see that you have your processor
> types segregated in their own partitions (with the exception of of the
> requeue partition), and that's how you get at the weighting mechanism. Do
> you have your users explicitly specify multiple partitions in the batch
> commands/scripts in order to take advantage of this, or do you use a plugin
> for it?
>
> It sounds like you don't impose any hard limit on simultaneous resource
> use, and allow everything to fairshare out with the help of the 7 day
> TimeLimit. We haven't been imposing any TimeLimit on our condo users, which
> would be an issue for us with your config. For our exploratory and priority
> users, we impose an effective time limit with GrpTRESRunMins=cpu (and
> gres/gpu= for the GPU usage). In addition, since we have so many priority
> users, we don't explicitly set a rawshare value for them (they all execute
> under the "default" account). We set rawshare for the condo accounts as
> cores-purchased/total-cores*1000.
>
> What's your fairshare decay setting (don't remember the proper name at the
> moment)?
>
> Regards,
> Sam
>
>
>
> On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon  wrote:
>
>> We do a similar thing here at Harvard:
>>
>> https://www.rc.fas.harvard.edu/fairshare/
>>
>> We simply weight all the partitions based on their core type and then we
>> allocate Shares for each account based on what they have purchased.  We
>> don't use QoS at all, so we just rely purely on fairshare weighting for
>> resource usage.  It has worked pretty well for our purposes.
>>
>> -Paul Edmon-
>> On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>>
>>
>> (...and yes, the name is inspired by a certain OEM's software licensing
>> schemes...)
>>
>> At Brown we run a ~400 node cluster containing nodes of multiple
>> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
>> some cases by University funds and in others by investigator funding
>> (~50:50).  They all appear in the default SLURM partition. We have 3
>> classes of SLURM users:
>>
>>
>>1. Exploratory - no-charge access to up to 16 cores
>>2. Priority - $750/quarter for access to up to 192 cores (and with a
>>GrpTRESRunMins=cpu limit). E

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
Hi Paul,

Thanks..Your setup is interesting. I see that you have your processor types
segregated in their own partitions (with the exception of of the requeue
partition), and that's how you get at the weighting mechanism. Do you have
your users explicitly specify multiple partitions in the batch
commands/scripts in order to take advantage of this, or do you use a plugin
for it?

It sounds like you don't impose any hard limit on simultaneous resource
use, and allow everything to fairshare out with the help of the 7 day
TimeLimit. We haven't been imposing any TimeLimit on our condo users, which
would be an issue for us with your config. For our exploratory and priority
users, we impose an effective time limit with GrpTRESRunMins=cpu (and
gres/gpu= for the GPU usage). In addition, since we have so many priority
users, we don't explicitly set a rawshare value for them (they all execute
under the "default" account). We set rawshare for the condo accounts as
cores-purchased/total-cores*1000.

What's your fairshare decay setting (don't remember the proper name at the
moment)?

Regards,
Sam



On Wed, Jun 19, 2019 at 3:44 PM Paul Edmon  wrote:

> We do a similar thing here at Harvard:
>
> https://www.rc.fas.harvard.edu/fairshare/
>
> We simply weight all the partitions based on their core type and then we
> allocate Shares for each account based on what they have purchased.  We
> don't use QoS at all, so we just rely purely on fairshare weighting for
> resource usage.  It has worked pretty well for our purposes.
>
> -Paul Edmon-
> On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
>
>
> (...and yes, the name is inspired by a certain OEM's software licensing
> schemes...)
>
> At Brown we run a ~400 node cluster containing nodes of multiple
> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
> some cases by University funds and in others by investigator funding
> (~50:50).  They all appear in the default SLURM partition. We have 3
> classes of SLURM users:
>
>
>1. Exploratory - no-charge access to up to 16 cores
>2. Priority - $750/quarter for access to up to 192 cores (and with a
>GrpTRESRunMins=cpu limit). Each user has their own QoS
>3. Condo - an investigator group who paid for nodes added to the
>cluster. The group has its own QoS and SLURM Account. The QoS allows use of
>the number of cores purchased and has a much higher priority than the QoS'
>of the "priority" users.
>
> The first problem with this scheme is that condo users who have purchased
> the older hardware now have access to the newest without penalty. In
> addition, we're encountering resistance to the idea of turning off their
> hardware and terminating their condos (despite MOUs stating a 5yr life).
> The pushback is the stated belief that the hardware should run until it
> dies.
>
> What I propose is a new TRES called a Processor Performance Unit (PPU)
> that would be specified on the Node line in slurm.conf, and used such that
> GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
> their associated PPU numbers.
>
> We could then assign a base PPU to the oldest hardware, say, "1" for
> Sandy/Ivy and increase for later architectures based on performance
> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
> the number of cores of the oldest architecture multiplied by the configured
> PPU/core, X, and repeat for any newer nodes/cores the investigator has
> purchased since.
>
> The result is that the investigator group gets to run on an approximation
> of the performance that they've purchased, rather on the raw purchased core
> count.
>
> Thoughts?
>
>
>


Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
Hi Alex,

Thanks. The issue is that we don't know where they'll end up running in the
heterogenous environment. In addition, because the limit is applied by
GrpTRES=cpu=N, someone buying 100 cores today shouldn't get access to 130
of todays cores.

Regards,
Sam

On Wed, Jun 19, 2019 at 3:41 PM Alex Chekholko  wrote:

> Hey Samuel,
>
> Can't you just adjust the existing "cpu" limit numbers using those same
> multipliers?  Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs.
>
> Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs
> because the CPUs are normalized to the old performance.  Since it would
> probably look bad politically to reduce someone's number, but giving a new
> customer a larger number should be fine.
>
> Regards,
> Alex
>
> On Wed, Jun 19, 2019 at 12:32 PM Fulcomer, Samuel <
> samuel_fulco...@brown.edu> wrote:
>
>>
>> (...and yes, the name is inspired by a certain OEM's software licensing
>> schemes...)
>>
>> At Brown we run a ~400 node cluster containing nodes of multiple
>> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
>> some cases by University funds and in others by investigator funding
>> (~50:50).  They all appear in the default SLURM partition. We have 3
>> classes of SLURM users:
>>
>>
>>1. Exploratory - no-charge access to up to 16 cores
>>2. Priority - $750/quarter for access to up to 192 cores (and with a
>>GrpTRESRunMins=cpu limit). Each user has their own QoS
>>3. Condo - an investigator group who paid for nodes added to the
>>cluster. The group has its own QoS and SLURM Account. The QoS allows use 
>> of
>>the number of cores purchased and has a much higher priority than the QoS'
>>of the "priority" users.
>>
>> The first problem with this scheme is that condo users who have purchased
>> the older hardware now have access to the newest without penalty. In
>> addition, we're encountering resistance to the idea of turning off their
>> hardware and terminating their condos (despite MOUs stating a 5yr life).
>> The pushback is the stated belief that the hardware should run until it
>> dies.
>>
>> What I propose is a new TRES called a Processor Performance Unit (PPU)
>> that would be specified on the Node line in slurm.conf, and used such that
>> GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
>> their associated PPU numbers.
>>
>> We could then assign a base PPU to the oldest hardware, say, "1" for
>> Sandy/Ivy and increase for later architectures based on performance
>> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
>> the number of cores of the oldest architecture multiplied by the configured
>> PPU/core, X, and repeat for any newer nodes/cores the investigator has
>> purchased since.
>>
>> The result is that the investigator group gets to run on an approximation
>> of the performance that they've purchased, rather on the raw purchased core
>> count.
>>
>> Thoughts?
>>
>>
>>


[slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
(...and yes, the name is inspired by a certain OEM's software licensing
schemes...)

At Brown we run a ~400 node cluster containing nodes of multiple
architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in
some cases by University funds and in others by investigator funding
(~50:50).  They all appear in the default SLURM partition. We have 3
classes of SLURM users:


   1. Exploratory - no-charge access to up to 16 cores
   2. Priority - $750/quarter for access to up to 192 cores (and with a
   GrpTRESRunMins=cpu limit). Each user has their own QoS
   3. Condo - an investigator group who paid for nodes added to the
   cluster. The group has its own QoS and SLURM Account. The QoS allows use of
   the number of cores purchased and has a much higher priority than the QoS'
   of the "priority" users.

The first problem with this scheme is that condo users who have purchased
the older hardware now have access to the newest without penalty. In
addition, we're encountering resistance to the idea of turning off their
hardware and terminating their condos (despite MOUs stating a 5yr life).
The pushback is the stated belief that the hardware should run until it
dies.

What I propose is a new TRES called a Processor Performance Unit (PPU) that
would be specified on the Node line in slurm.conf, and used such that
GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by
their associated PPU numbers.

We could then assign a base PPU to the oldest hardware, say, "1" for
Sandy/Ivy and increase for later architectures based on performance
improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is
the number of cores of the oldest architecture multiplied by the configured
PPU/core, X, and repeat for any newer nodes/cores the investigator has
purchased since.

The result is that the investigator group gets to run on an approximation
of the performance that they've purchased, rather on the raw purchased core
count.

Thoughts?


Re: [slurm-users] MaxTRESRunMinsPU not yet enabled - similar options?

2019-05-20 Thread Fulcomer, Samuel
On Mon, May 20, 2019 at 2:59 PM  wrote:

>
>
>
> I did test setting GrpTRESRunMins=cpu=N for each user + account
> association, and that does appear to work. Does anyone know of any other
> solutions to this issue?


No. Your solution is what we currently do. A "...PU" would be a nice, tidy
addition for the QOS entity.

regards,
s

>
> Thanks,
> Jesse Stroik
>
>


Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Fulcomer, Samuel
We went straight to ESSL. It also has FFTs and selected LAPACK, some with
GPU support (
https://www-01.ibm.com/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_sm/1/872/ENUS5765-L61/index.html=en_locale=en
).

I also try to push people to use MKL on Intel, as it has multi-code-path
execution (we have a mix of architectures in our default patch partition).

On Tue, Apr 16, 2019 at 1:59 PM Prentice Bisbal  wrote:

> Thanks for the info. Did you try building/using any of the open-source
> math libraries for Power9, like OpenBLAS, or did you just use ESSL for
> everything?
>
> Prentice
>
> On 4/16/19 1:12 PM, Fulcomer, Samuel wrote:
>
> We had an AC921 and AC922 as a while as loaners.
>
> We had no problems with SLURM.
>
> Getting POWERAI running correctly (bugs since fixed in newer release) and
> apps properly built and linked to ESSL was the long march.
>
> regards,
> s
>
> On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal  wrote:
>
>> Sergi,
>>
>> I'm working with Bill on this project. Is all the hardware
>> identification/mapping and task affinity working as expected/desired
>> with the Power9? I assume your answer implies "yes", but I just want to
>> make sure.
>>
>> Prentice
>>
>> On 4/16/19 10:37 AM, Sergi More wrote:
>> > Hi,
>> >
>> > We have a Power9 cluster (AC922) working without problems. Now with
>> > 18.08, but have been running as well with 17.11. No extra
>> > steps/problems found during installation because of Power9.
>> >
>> > Thank you,
>> > Sergi.
>> >
>> >
>> > On 16/04/2019 16:05, Bill Wichser wrote:
>> >> Does anyone on this list run Slurm on the Sierra-like machines from
>> >> IBM?  I believe they are the ACC922 nodes.  We are looking to
>> >> purchase a small cluster of these nodes but have concerns about the
>> >> scheduler.
>> >>
>> >> Just looking for a nod that, yes it works fine, as well as any issues
>> >> seen during deployment.  Danny says he has heard of no problems but
>> >> that doesn't mean the folks in the trenches haven't seen issues!
>> >>
>> >> Thanks,
>> >> Bill
>> >>
>>
>>


Re: [slurm-users] Power9 ACC922

2019-04-16 Thread Fulcomer, Samuel
We had an AC921 and AC922 as a while as loaners.

We had no problems with SLURM.

Getting POWERAI running correctly (bugs since fixed in newer release) and
apps properly built and linked to ESSL was the long march.

regards,
s

On Tue, Apr 16, 2019 at 12:59 PM Prentice Bisbal  wrote:

> Sergi,
>
> I'm working with Bill on this project. Is all the hardware
> identification/mapping and task affinity working as expected/desired
> with the Power9? I assume your answer implies "yes", but I just want to
> make sure.
>
> Prentice
>
> On 4/16/19 10:37 AM, Sergi More wrote:
> > Hi,
> >
> > We have a Power9 cluster (AC922) working without problems. Now with
> > 18.08, but have been running as well with 17.11. No extra
> > steps/problems found during installation because of Power9.
> >
> > Thank you,
> > Sergi.
> >
> >
> > On 16/04/2019 16:05, Bill Wichser wrote:
> >> Does anyone on this list run Slurm on the Sierra-like machines from
> >> IBM?  I believe they are the ACC922 nodes.  We are looking to
> >> purchase a small cluster of these nodes but have concerns about the
> >> scheduler.
> >>
> >> Just looking for a nod that, yes it works fine, as well as any issues
> >> seen during deployment.  Danny says he has heard of no problems but
> >> that doesn't mean the folks in the trenches haven't seen issues!
> >>
> >> Thanks,
> >> Bill
> >>
>
>


Re: [slurm-users] Topology configuration questions:

2019-01-17 Thread Fulcomer, Samuel
Yes, well, the trivial cat-skinning method is to use topology.conf to
describe multiple switch topologies confining each architecture to their
meta-fabric. We use GPFS as a parallel filesystem, and all nodes are
connected, but topology.conf keeps jobs on uniform-architecture collectives.

On Thu, Jan 17, 2019 at 8:05 PM Nicholas McCollum  wrote:

> I recommend putting heterogeneous node types each into their own patition
> to keep jobs from spanning multiple node types.  You can also set QoS's for
> different partitions and make a job in that QoS only able to be scheduled
> on nodes=1.  You could also accomplished this with a partition config in
> your slurm.conf... or use the job_submit.lua plugin to capture jobs
> submitted to that partition and change max nodes =1.
> There's a lot of easy ways to skin that cat.
>
> I personally like using the submit all jobs to all partitions plugin and
> having users constrain to specific types of nodes using the
> --constraint=whatever flag.
>
>
> Nicholas McCollum
> Alabama Supercomputer Authority
> ------
> *From:* "Fulcomer, Samuel" 
> *Sent:* Thursday, January 17, 2019 5:58 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] Topology configuration questions:
>
> We use topology.conf to segregate architectures (Sandy->Skylake), and also
> to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU
> nodes with deprecated IB cards). In the latter case, topology.conf had a
> switch entry for each node.
>
> It used to be the case that SLURM was unhappy with nodes defined in
> slurm.conf not appearing in topology.conf. This may have changed
>
> On Thu, Jan 17, 2019 at 6:37 PM Ryan Novosielski 
> wrote:
>
>> I don’t actually know the answer to this one, but we have it provisioned
>> to all nodes.
>>
>> Note that if you care about node weights (eg. NodeName=whatever001
>> Weight=2, etc. in slurm.conf), using the topology function will disable it.
>> I believe I was promised a warning about that in the future in a
>> conversation with SchedMD.
>>
>> > On Jan 17, 2019, at 4:52 PM, Prentice Bisbal  wrote:
>> >
>> > And a follow-up question: Does topology.conf need to be on all the
>> nodes, or just the slurm controller? It's not clear from that web page. I
>> would assume only the controller needs it.
>> >
>> > Prentice
>> >
>> > On 1/17/19 4:49 PM, Prentice Bisbal wrote:
>> >> From https://slurm.schedmd.com/topology.html:
>> >>
>> >>> Note that compute nodes on switches that lack a common parent switch
>> can be used, but no job will span leaf switches without a common parent
>> (unless the TopologyParam=TopoOptional option is used). For example, it is
>> legal to remove the line "SwitchName=s4 Switches=s[0-3]" from the above
>> topology.conf file. In that case, no job will span more than four compute
>> nodes on any single leaf switch. This configuration can be useful if one
>> wants to schedule multiple phyisical clusters as a single logical cluster
>> under the control of a single slurmctld daemon.
>> >>
>> >> My current environment falls into the category of multiple physical
>> clusters being treated as a single logical cluster under the control of a
>> single slurmctld daemon. At least, that's my goal.
>> >>
>> >> In my environment, I have 2 "clusters" connected by their own separate
>> IB fabrics, and one "cluster" connected with 10 GbE. I have a fourth
>> cluster connected with only 1GbE. For this 4th cluster, we don't want jobs
>> to span nodes, due to the slow performance of 1 GbE. (This cluster is
>> intended for serial and low-core count parallel jobs) If I just leave those
>> nodes out of the topology.conf file, will that have the desired affect of
>> not allocating multi-node jobs to those nodes, or will it result in an
>> error of some sort?
>> >>
>> >
>>
>>


Re: [slurm-users] Topology configuration questions:

2019-01-17 Thread Fulcomer, Samuel
We use topology.conf to segregate architectures (Sandy->Skylake), and also
to isolate individual nodes with 1Gb/s Ethernet rather than IB (older GPU
nodes with deprecated IB cards). In the latter case, topology.conf had a
switch entry for each node.

It used to be the case that SLURM was unhappy with nodes defined in
slurm.conf not appearing in topology.conf. This may have changed

On Thu, Jan 17, 2019 at 6:37 PM Ryan Novosielski 
wrote:

> I don’t actually know the answer to this one, but we have it provisioned
> to all nodes.
>
> Note that if you care about node weights (eg. NodeName=whatever001
> Weight=2, etc. in slurm.conf), using the topology function will disable it.
> I believe I was promised a warning about that in the future in a
> conversation with SchedMD.
>
> > On Jan 17, 2019, at 4:52 PM, Prentice Bisbal  wrote:
> >
> > And a follow-up question: Does topology.conf need to be on all the
> nodes, or just the slurm controller? It's not clear from that web page. I
> would assume only the controller needs it.
> >
> > Prentice
> >
> > On 1/17/19 4:49 PM, Prentice Bisbal wrote:
> >> From https://slurm.schedmd.com/topology.html:
> >>
> >>> Note that compute nodes on switches that lack a common parent switch
> can be used, but no job will span leaf switches without a common parent
> (unless the TopologyParam=TopoOptional option is used). For example, it is
> legal to remove the line "SwitchName=s4 Switches=s[0-3]" from the above
> topology.conf file. In that case, no job will span more than four compute
> nodes on any single leaf switch. This configuration can be useful if one
> wants to schedule multiple phyisical clusters as a single logical cluster
> under the control of a single slurmctld daemon.
> >>
> >> My current environment falls into the category of multiple physical
> clusters being treated as a single logical cluster under the control of a
> single slurmctld daemon. At least, that's my goal.
> >>
> >> In my environment, I have 2 "clusters" connected by their own separate
> IB fabrics, and one "cluster" connected with 10 GbE. I have a fourth
> cluster connected with only 1GbE. For this 4th cluster, we don't want jobs
> to span nodes, due to the slow performance of 1 GbE. (This cluster is
> intended for serial and low-core count parallel jobs) If I just leave those
> nodes out of the topology.conf file, will that have the desired affect of
> not allocating multi-node jobs to those nodes, or will it result in an
> error of some sort?
> >>
> >
>
>


Re: [slurm-users] How to delete an association

2019-01-03 Thread Fulcomer, Samuel
Great.

Yes, I forgot to mention that running or pending jobs can prevent deletion
of this information. This makes scripting/automating all the sacctmgr
functions somewhat difficult.

regards,
Sam

On Thu, Jan 3, 2019 at 10:18 AM Jianwen Wei  wrote:

> Thank you, Samuel. I've successfully delete the association with the
> following command after the users' jobs completes.
>
> # sacctmgr delete user where name=clschf partition=k80 account=acct-clschf
>
> Best,
>
> Jianwen
>
> On Dec 29, 2018, at 11:50, Fulcomer, Samuel 
> wrote:
>
> ...right. An association isn't an "entity". You want to delete a "user"
> where name=clschf partition=k80 account=acct-clschf .
>
> This won't entirely delete the user entity, only the record/association
> matching the name/partition/account spec.
>
> The foundation of SLURM nomenclature has some unfortunate choices.
>
> On Fri, Dec 28, 2018 at 10:02 PM Jianwen Wei 
> wrote:
>
>> Hi,
>>
>> I want to purge resource limit set by an association before, say
>>
>> *[root@slurm1]~# sacctmgr show asso partition=k80 account=acct-clschf*
>> *   ClusterAccount   User  Partition Share GrpJobs
>> GrpTRES GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES
>> MaxTRESPerNode MaxSubmit MaxWall   MaxTRESMins  QOS
>> Def QOS GrpTRESRunMin*
>> *-- -- -- -- - ---
>> - - --- - --- -
>> -- - --- - 
>> - -*
>> *sjtupi acct-clsc+ clschfk80   100
>>
>>  normal,qoslong,qosp+normal*
>>
>>
>> However, according to  https://slurm.schedmd.com/sacctmgr.html "Add,
>> modify, and delete should be done to a user, account or cluster entity.
>> This will in-turn update the underlying associations." . Individual
>> associations can not be deleted. Am I right?
>>
>> Best,
>>
>> Jianwen
>>
>
>


Re: [slurm-users] How to delete an association

2018-12-28 Thread Fulcomer, Samuel
...right. An association isn't an "entity". You want to delete a "user"
where name=clschf partition=k80 account=acct-clschf .

This won't entirely delete the user entity, only the record/association
matching the name/partition/account spec.

The foundation of SLURM nomenclature has some unfortunate choices.

On Fri, Dec 28, 2018 at 10:02 PM Jianwen Wei  wrote:

> Hi,
>
> I want to purge resource limit set by an association before, say
>
> *[root@slurm1]~# sacctmgr show asso partition=k80 account=acct-clschf*
> *   ClusterAccount   User  Partition Share GrpJobs
> GrpTRES GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES
> MaxTRESPerNode MaxSubmit MaxWall   MaxTRESMins  QOS
> Def QOS GrpTRESRunMin*
> *-- -- -- -- - ---
> - - --- - --- -
> -- - --- - 
> - -*
> *sjtupi acct-clsc+ clschfk80   100
>
>normal,qoslong,qosp+normal*
>
>
> However, according to  https://slurm.schedmd.com/sacctmgr.html "Add,
> modify, and delete should be done to a user, account or cluster entity.
> This will in-turn update the underlying associations." . Individual
> associations can not be deleted. Am I right?
>
> Best,
>
> Jianwen
>


Re: [slurm-users] Looking for old SLURM versions

2018-10-25 Thread Fulcomer, Samuel
We've got 15.0.8/9.

-s

On Wed, Oct 24, 2018 at 5:51 PM, Bob Healey  wrote:

> I'm in the process of upgrading a system that has been running 2.5.4 for
> the last 5 years with no issues.  I'd like to bring that up to something
> current, but I need a a bunch of older versions that do not appear to be
> online any longer to successfully migrate the database from ancient to
> current.  I think the only major versions I'm missing is 15.x.x - Can I get
> from where I am to 17.02.11 with what I've got, and if not, does anyone
> have copies of the versions I'm missing to get to where I want to be?
>
> I've got:
>
> slurm-14.11.3.tar.bz2
> slurm-14.11.6.tar.bz2
> slurm-14.11.8.tar.bz2
> slurm-14.11.9.tar.bz2
> slurm-16.05.7.tar.bz2
> slurm-17.02.10.tar.bz2
> slurm-17.02.11.tar.bz2
> slurm-17.02.6.tar.bz2
> slurm-17.02.6.tar.bz2
> slurm-17.02.9.tar.bz2
> slurm-2.6.4.tar.bz2
>
>
> --
> Bob Healey
> Systems Administrator
> Office of Research and
> Scientific Computation Research Center
> hea...@rpi.edu
> (518) 276-6022
>
>
>


Re: [slurm-users] network/communication failure

2018-05-21 Thread Fulcomer, Samuel
Is there a firewall turned on? What does "iptables -L -v" report on the
three hosts?

On Mon, May 21, 2018 at 11:05 AM, Turner, Heath  wrote:

> If anyone has advice, I would really appreciate...
>
> I am running (just installed) slurm-11.17.6, with a master + 2 hosts.  It
> works locally on the master (controller + execution).  However, I cannot
> establish communication from master [triumph01] with the 2 hosts
> [triumph02,triumph03].  Here is some more info:
>
> 1. munge is running, and munge verification tests all pass.
> 2. system clocks are in sync on master/hosts.
> 3. identical slurm.conf files are on master/hosts.
> 4. configuration of resources (memory/cpus/etc) are correct and have been
> confirmed on all machines (all hardware is identical).
> 5. I have attached:
> a) slurm.conf
> b) log file from master slurmctld
> c) log file from host slurmd
>
> Any ideas about what to try next?
>
> Heath Turner
>
> Professor
> Graduate Coordinator
> Chemical and Biological Engineering
> http://che.eng.ua.edu
>
> University of Alabama
> 3448 SEC, Box 870203
> Tuscaloosa, AL  35487
> (205) 348-1733 (phone)
> (205) 561-7450 (cell)
> (205) 348-7558 (fax)
> htur...@eng.ua.edu
> http://turnerresearchgroup.ua.edu
>
>


Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread Fulcomer, Samuel
This came up around 12/17, I think, and as I recall the fixes were added to
the src repo then; however, they weren't added to any fo the 17.releases.

On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand  wrote:

> I dug into the logs on both the slurmctld side and the slurmd side.
> For the record, I have debug2 set for both and
> DebugFlags=CPU_BIND,Gres.
>
> I cannot see much that is terribly relevant in the logs.  There's a
> known parameter error reported with the memory cgroup specifications,
> but I don't think that is germane.
>
> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
> as:
>
> [2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
> /dev/nvidia0 for job
> [2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
> device /dev/nvidia1 for job
>
> However, I can still "see" both devices from nvidia-smi, and I can
> still access both if I manually unset CUDA_VISIBLE_DEVICES.
>
> When I do *not* specify --gres at all, there is no reference to gres,
> gpu, nvidia, or anything similar in any log at all.  And, of course, I
> have full access to both GPUs.
>
> I am happy to attach the snippets of the relevant logs, if someone
> more knowledgeable wants to pour through them.  I can also set the
> debug level higher, if you think that would help.
>
>
> Assuming upgrading will solve our problem, in the meantime:  Is there
> a way to ensure that the *default* request always has "--gres=gpu:1"?
> That is, this situation is doubly bad for us not just because there is
> *a way* around the resource management of the device but also because
> the *DEFAULT* behavior if a user issues an srun/sbatch without
> specifying a Gres is to go around the resource manager.
>
>
>
> On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel 
> wrote:
> > On 02/05/18 10:15, R. Paul Wiegand wrote:
> >
> >> Yes, I am sure they are all the same.  Typically, I just scontrol
> >> reconfig; however, I have also tried restarting all daemons.
> >
> >
> > Understood. Any diagnostics in the slurmd logs when trying to start
> > a GPU job on the node?
> >
> >> We are moving to 7.4 in a few weeks during our downtime.  We had a
> >> QDR -> OFED version constraint -> Lustre client version constraint
> >> issue that delayed our upgrade.
> >
> >
> > I feel your pain..  BTW RHEL 7.5 is out now so you'll need that if
> > you need current security fixes.
> >
> >> Should I just wait and test after the upgrade?
> >
> >
> > Well 17.11.6 will be out then that will include for a deadlock
> > that some sites hit occasionally, so that will be worth throwing
> > into the mix too.   Do read the RELEASE_NOTES carefully though,
> > especially if you're using slurmdbd!
> >
> >
> > All the best,
> > Chris
> > --
> >  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> >
>
>