[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-06 Thread Daniel Letai via slurm-users


  
  
There is a kubeflow offering that might be of interest:
https://www.dkube.io/post/mlops-on-hpc-slurm-with-kubeflow


I have not tried it myself, no idea how well it works.


Regards,
--Dani_L.


On 05/05/2024 0:05, Dan Healy via
  slurm-users wrote:


  
  
Bright Cluster Manager has some verbiage on their marketing
  site that they can manage a cluster running both Kubernetes
  and Slurm. Maybe I misunderstood it. But nevertheless, I am
  encountering groups more frequently that want to run a stack
  of containers that need private container networking. 


What’s the current state of using the same HPC
  cluster for both Slurm and Kube? 


Note: I’m aware that I can run Kube on a single
  node, but we need more resources. So ultimately we need a way
  to have Slurm and Kube exist in the same cluster, both sharing
  the full amount of resources and both being fully aware of
  resource usage. 


  Thanks,

Daniel Healy

  
  
  
  


  


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Usage of particular GPU out of 4 GPUs while submitting jobs to DGX Server

2023-11-20 Thread Daniel Letai


  
  
Hi Ravi,

On 20/11/2023 6:36, Ravi Konila wrote:


  
  

  Hello Everyone
   
  
  
  My question is related to submission of jobs to those
GPUs. How do a student submit the job to a particular GPU
out of 4 GPUs? For example, studentA should submit the job
to GPU ID 1 instead of GPU ID 0. 

  

In classical HPC this is a counterproductive - you don't want to
  assign specific resources to jobs, as this would lead to jobs
  waiting needlessly while resources are available, so I think some
  background for this request might help understand the need and
  possible solutions.


That said, it might be possible by assigning different artificial
  types to each gpu, e.g. in gres.conf Name=gpu type=gpu0
  file=/dev/nvidia0 etc...
Then submission would be of the form
sbatch --gpus=gpu0


The issue would be with submitting in the general case, where you
  want any gpu. For that you might have to fall back to using gres
  as in
sbatch --gres=gpu:3


This is obviously cumbersome and less convenient, and I'm not
  sure this is not an XY problem.


  

   
  Also we are planning for MIG in the server and we would
like few students to submit the jobs to 20G partition and
non critical jobs to 5G partition. 
  How should be the slurm.conf and gres.conf in this case.
  

  

Can you elaborate on the use case? It's unclear to me if the
students are expected to decide on their own when to submit to 20G
and when to 5G, if students with access to 20G should also use the
5G together with the rest of the students, or if all students should
have access to both partitions and some other criteria should be
used to determine placement.

  

   
  Currently our configuration is as below:
   
  gres.conf
  Name=gpu    type=A100    file=/dev/nvidia[0-2,4]
   
  
  slurm.conf
  .
  .
  .
  GresTypes=gpu
  NodeName=rl-dgxs-r21-l2 Gres=gpu:A100:4 CPUs=128
RealMemory=50 State=UNKNOWN
  PartitionName=LocalGPUQ Nodes=ALL Default=YES
MaxTime=INFINITE State=UP
   
  -
   
  Any suggestions or help in this regard is highly
appreciated. 
   
  With
Warm Regards
Ravi Konila
  

  

Best regards,
--Dani_L.

  




Re: [slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Daniel Letai

  
  
Not sure about automatically canceling a job array, except
  perhaps by submitting 2 consecutive arrays - first of size 20, and
  the other with the rest of the elements and a dependency of
  afterok. That said, a single job in a job array in Slurm
  documentation is referred to as a task. I personally prefer
  element, as in array element.



Consider creating a batch job with:


arrayid=$(sbatch --parsable --array=0-19 array-job.sh)
sbatch --dependency=afterok:$arrayid --array=20-5
  array-job.sh


I'm not near a cluster right now, so can't test for correctness.
  The main drawback is of course if 20 jobs takes a long time to
  complete, and there are enough resources to run more than 20 jobs
  in parallel, all those resources will be wasted for the duration.
  Not a big issue in busy clusters, as some other job will run in
  the meantime, but this will impact completion time of the array,
  if 20 jobs use significantly less than the resources available.



It might be possible to depend on afternotok of the first 20
  tasks, to run --wrap="scancel $arrayid"


Maybe something like:


sbatch --array=1-5 array-job.sh
with

cat array-job.sh


  
#!/bin/bash


srun myjob.sh $SLURM_ARRAY_TASK_ID &

[[ $SLURM_ARRAY_TASK_ID -gt 20  ]] && srun -d
  afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:${SLURM_ARRAY_JOB_ID}_20
  scancel $SLURM_ARRAY_JOB_ID



  


  Will also work. Untested, use at your own risk.



The other OTHER approach might be to use some epilog (or possibly
  epilogslurmctld) to log exit codes for first 20 tasks in each
  array, and cancel the array if non-zero. This is a global approach
  which will affect all job arrays, so might not be appropriate for
  your use case.



On 01/08/2023 16:48:47, Josef Dvoracek
  wrote:

my users
  found the beauty of job arrays, and they tend to use it every then
  and now.
  
  
  Sometimes human factor steps in, and something is wrong in job
  array specification, and cluster "works" on one failed array job
  after another.
  
  
  Isn't there any way how to automatically stop/scancel/? job array
  after, let say, 20 failed array jobs in row?
  
  
  So far my experience is, if first ~20 array jobs go right, there
  is no catastrophic failure in sbatch-file. If they fail, usually
  it's bad and there is no sense to crunch the remaining thousands
  of job array jobs.
  
  
  OT: what is the correct terminology for one item in job array...
  sub-job? job-array-job? :)
  
  
  cheers
  
  
  josef
  
  
  

-- 
Regards,

--Dani_L.
  




Re: [slurm-users] Slurmdbd High Availability

2023-04-15 Thread Daniel Letai

  
  
My go to solution is setting up Galera cluster using 2 slurmdbd
  servers (each pointing to it's local db) and a 3rd quorum server.
  It's fairly easy to setup and doesn't rely on block level
  duplication, HA semantics or shared storage.


Just my 2 cents



On 14/04/2023 14:18, Tina Friedrich
  wrote:

Or run
  your database server on something like VMWare ESXi (which is what
  we do). Instant HA and I don't even need multiple servers for it
  :)
  
  
  I don't mean to be flippant, and I realise it's not addressing the
  mysql HA question (but that got answered). However, a lot of us
  will have some sort of failure-and-load-balancing VM estate
  anyway, or not? Using that does - at least in my mind - solve the
  same problem (just via a slightly different route).
  
  
  Other than that I'd agree that HA solutions - of the pacemaker
  & mirrored block devices sort - tend to make things less
  reliable instead of more.
  
  
  Tina
  
  
  On 13/04/2023 16:03, Brian Andrus wrote:
  
  I think you mean both slurmctld servers
are pointing the one slurmdbd server.


Ole is right about the usefulness of HA, especially on slurmdbd,
as slurm will cache the writes to the database if it is down.


To do what you want, you need to look at configuring your
database to be HA. That is a different topic and would be
dictated by what database setup you are using. Understand the
the backend database is a tool used by slurm and not part of
slurm. So any HA in that are needs to be done by the database.


Once that is done, merely have 2 separate slurmdbd servers, each
pointing at the HA database. One would be primary and the other
a failover (AccountingStorageBackupHost). Although, technically,
they would both be able to be active at the same time.


Brian Andrus


On 4/13/2023 2:49 AM, Shaghuf Rahman wrote:

Hi,
  
  
  I am setting up Slurmdb in my system and I need some inputs
  
  
  My current setup is like
  
  server1 : 192.168.123.12(slurmctld)
  
  server2: 192.168.123.13(Slurmctld)
  
  server3: 192.168.123.14(Slurmdbd) which is pointing to both
  Server1 and Server2.
  
  database: MySQL
  
  
  I have 1 more server named as server 4: 192.168.123.15 which I
  need to make it as a secondary database server. I want to
  configure this server4 which will sync the database and make
  it either Active-Active slurmdbd or Active-Passive.
  
  
  Could anyone please help me with the *steps* how to configure
  and also how am i going to *sync* my *database* on both the
  servers simultaneously.
  
  
  Thanks & Regards,
  
  Shaghuf Rahman
  
  

  
  

  




Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-12 Thread Daniel Letai
MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP
Nodes=nodeGPU01 Default=YES 
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64
DefMemPerNode=16384 MaxMemPerNode=42 MaxTime=3-12:00:00
State=UP Nodes=nodeGPU01 


  
  -- 
  

  Cristóbal A. Navarro

  

  

-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




Re: [slurm-users] unused job data fields?

2022-10-04 Thread Daniel Letai

  
  
If you're looking for a free text filed, I would posit that the
  "comment" field supplied by '--comment' flag of srun/sbatch and
  viewed by the comment field of sacct, is what your looking for.

On 03/10/2022 12:25:37, z1...@arcor.de
  wrote:


  Hello,

are there additional job data fields in slurm besides the job name which
can be used for additional information?

The information should not be used by slurm, only included in the
database for external evaluation.

Thanks
Mike




-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




Re: [slurm-users] srun using infiniband

2022-09-03 Thread Daniel Letai

  
  
Hello Anne,


On 01/09/2022 02:01:53, Anne Hammond
  wrote:


  
  We have a 
  CentOS 8.5 cluster 
  slurm 20.11
  Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch


Our application is not scaling.  I discovered the process
  communications are going over ethernet, not ib.  I used the
  ifconfig count for the eno2 (ethernet) and ib0 (infiniband)
  interfaces at end of a job, and subtracted the count at the
  beginning.   We are using sbatch and
srun {application}


If I interactively login to a node and use the command
mpiexec -iface ib0 -n 32 -machinefile machinefile
  {application}
  

Is your application using IPoIB or RDMA?

  


where machinefile contains 32 lines with the ib hostname:
ne08-ib
ne08-ib
...
ne09-ib
ne09-ib


the application runs over ib and scales.  


/etc/slurm/slurm.conf uses the ethernet interface for
  administrative communications and allocation:



  NodeName=ne[01-09]
  CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
  State=UNKNOWN
  

  

  PartitionName=neon-noSMT
  Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00
  DefaultTime=4:00:00 State=UP OverSubscribe=YES
  

  I've
  read this is the recommended configuration.



I looked for srun parameters that would instruct srun to
  run over the ib interface when the job is run through the
  slurm queue.  


I found the --network parameter:

  srun
  --network=DEVNAME=mlx5_ib,DEVTYPE=IB

  

What is the output of 

srun --mpi=list ?


  

  

  but
  there is not much documentation on this and I haven't been
  able to run a job yet.
  

  Is this
  the way we should be directing srun to run the executable
  over infiniband?
  

  Thanks
  in advance,
  Anne
  Hammond





  

-- 
Regards,
--Dani_L.
  




Re: [slurm-users] do oversubscription with algorithm other than least-loaded?

2022-03-03 Thread Daniel Letai

  
  
I could be missing something here, but if you refer to the SelectTypeParameters=cr_lln
  you could just try cr_pack_nodes.
https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Pack_Nodes



If you want it on a per-partition configuration, I'm not sure
  that's possible, you might need to set a distribution (-m) in your
  job submit script/wrapper (E.g., -m block:*:*,pack)
https://slurm.schedmd.com/sbatch.html#OPT_distribution



If you're referring to something else entirely, could you
  elaborate on the least-loaded configuration in your setup?



  
On 24/02/2022 23:35:30, Herc
  Silverstein wrote:


  
  Hi,
  We would like to do over-subscription on a cluster that's
running in the cloud.  The cluster dynamically spins up and down
cpu nodes as needed.  What we see is that the least-loaded
algorithm causes the maximum number of nodes specified in the
partition to be spun up and each loaded with N jobs for the N
cpu's in a node before it "doubles back" and starts
over-subscribing.
  What we actually want is for the minimum number of
nodes to be used and for it to fully load (to the limit of the
oversubscription setting) one node before starting up another. 
That is, we really want a "most-loaded" algorithm.  This would
allow us to reduce the number of nodes we need to run and reduce
costs.
  Is there a way to get this behavior somehow?
  Herc
  
  
  
  
    
-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




Re: [slurm-users] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

2021-02-18 Thread Daniel Letai

  
  
I don't have access to a cluster right now so can't test this,
  but possibly tres_alloc



squeue -O JobID,Partition,Name,tres_alloc,NodeList
  -j 



might give some more info.




On 04/02/2021 17:01, Thomas Zeiser
  wrote:


  Dear All,

we are running Slurm-20.02.6 and using
"SelectType=select/cons_tres" with
"SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",
and "ProctrackType=proctrack/cgroup". Nodes can be shared between
multiple jobs with the partition defaults "ExclusiveUser=no
OverSubscribe=No"

For monitoring purpose, we'd like to know on the ControlMachine
which cores of a batch node are assigned to a specific job. Is
there any way (except looking on each batch node itself into
/sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or
GPU IDs?

E.g. from Torque we are used that qstat tells the assigned cores.
However, with Slurm, even "scontrol show job JOBID" does not seem
to have any information in that direction.

Knowing which GPU is allocated (in case of gres/gpu) of course
also would be interested to know on the ControlMachine.


Here's the output we get from scontrol show job; it has the node
name and the number of cores assigned but not the "core IDs" (e.g.
32-63)

JobId=886 JobName=br-14
   UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A
   Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51
   AccrueTime=2021-02-04T07:26:51
   StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A
   PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54
   Partition=a100 AllocNode:Sid=gpu001:1743663
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu001
   BatchHost=gpu001
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=12M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/var/tmp/slurmd_spool/job00877/slurm_script
   WorkDir=/home/hpc114/run2
   StdErr=/home/hpc114//run2/br-14.o886
   StdIn=/dev/null
   StdOut=/home/hpc114/run2/br-14.o886
   Power=
   TresPerNode=gpu:a100:1
   MailUser=(null) MailType=NONE

Also "scontrol show node" is not helpful

NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 
   CPUAlloc=128 CPUTot=128 CPULoad=4.09
   AvailableFeatures=hwperf
   ActiveFeatures=hwperf
   Gres=gpu:a100:4(S:0-1)
   NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6
   OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 
   RealMemory=51 AllocMem=48 FreeMem=495922 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A
   Partitions=a100 
   BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05
   CfgTRES=cpu=128,mem=51M,billing=128,gres/gpu=4,gres/gpu:a100=4
   AllocTRES=cpu=128,mem=48M,gres/gpu=4,gres/gpu:a100=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

There is no information on the currently running four jobs
included; neither which share of the allocated node is assigned to
the individual jobs.


I'd like to see isomehow that job 886 got cores 32-63,160-191
assigned as seen on the node from /sys/fs/cgroup

%cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus
32-63,160-191


Thanks for any ideas!

Thomas Zeiser



  




Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

2020-10-21 Thread Daniel Letai

  
  
Just a quick addendum - rsmi_dev_drm_render_minor_get
used in the plugin references the ROCM-SMI lib from https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/2e8dc4f2a91bfa7661f4ea289736b12153ce23c2/src/rocm_smi.cc#L1689
  So the library (as an .so file) should be installed for this to
  work.




On 20/10/2020 23:58, Mgr. Martin Pecka
  wrote:

Pinging
  this topic again. Nobody has an idea how to define multiple files
  to be treated as a single gres?
  
  
  Thank you for help,
  
  
  Martin Pecka
  
  
  Dne 4.9.2020 v 21:29 Martin Pecka napsal(a):
  
  
  Hello, we want to use EGL backend for
accessing OpenGL without the need for Xorg. This approach
requires access to devices /dev/dri/card* and /dev/dri/renderD*
. Is there a way to give access to these devices along with
/dev/nvidia* which we use for CUDA? Ideally as a single generic
resource that would give permissions to all three files at once.


Thank you for any tips.


  
  
  

  




Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

2020-10-21 Thread Daniel Letai

  
  
Take a look at https://github.com/SchedMD/slurm/search?q=dri%2F
If the ROCM-SMI API is present, using AutoDetect=rsmi in
  gres.conf might be enough, if I'm reading this right.


Of course, this assumes the cards in question are AMD and not
  NVIDIA.


On 20/10/2020 23:58, Mgr. Martin Pecka
  wrote:

Pinging
  this topic again. Nobody has an idea how to define multiple files
  to be treated as a single gres?
  
  
  Thank you for help,
  
  
  Martin Pecka
  
  
  Dne 4.9.2020 v 21:29 Martin Pecka napsal(a):
  
  
  Hello, we want to use EGL backend for
accessing OpenGL without the need for Xorg. This approach
requires access to devices /dev/dri/card* and /dev/dri/renderD*
. Is there a way to give access to these devices along with
/dev/nvidia* which we use for CUDA? Ideally as a single generic
resource that would give permissions to all three files at once.


Thank you for any tips.


  
  
  

  




Re: [slurm-users] how to restrict jobs

2020-05-07 Thread Daniel Letai

  
  
On 06/05/2020 20:44, Mark Hahn wrote:

  Is there no way to set or define a custom
variable like at node level and

  
  
  you could use a per-node Feature for this, but a partition would
  also work.
  
  

A bit of an ugly hack, but  you could use QoS (requires
  accounting) to enforce this:
1. Create a qos (using sacctmgr) with GrpTRES=Node=4
2. Create a new partiton identical to the current one, but with
  the new qos 

3. instruct users to submit to the new partition any job
  requiring the license.


This will not solve the issue of fragmentation due to
  non-licensed jobs - for that you should enable a packing scheduler
  like 

SelectTypeParameters=CR_Pack_Nodes
  (https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Pack_Nodes).

  




Re: [slurm-users] not allocating jobs even resources are free

2020-05-03 Thread Daniel Letai

  
  

On 29/04/2020 12:00:13, navin
  srivastava wrote:


  
  Thanks Daniel.
 
All jobs went into run state so unable to provide the
  details but definitely will reach out later if we see similar
  issue.


i am more interested to understand the FIFO with Fair
  Tree.it will be good if anybody provide some insight on this
  combination and also if we will enable the backfilling here
  how the behaviour will change.


what is the role of the Fair tree here?

  

Fair tree is the algorithm used to calculate the interim
  priority, before applying weight, but I think after the halflife
  decay.


To make it simple - fifo without fairshare would assign priority
  based only on submission time. With faishare, that naive priority
  is adjusted based on prior usage by the applicable entities
  (users/departments - accounts).


Backfill will let you utilize your resources better, since it
  will allow "inserting" low priority jobs before higher priority
  jobs, provided all jobs have defined wall times, and any inserted
  job doesn't affect in any way the start time of a higher priority
  job, thus allowing utilization of "holes" when the scheduler waits
  for resources to free up, in order to insert some large job.


Suppose the system is at 60% utilization of cores, and the next
  fifo job requires 42% - it will wait until 2% are free so it can
  begin, meanwhile not allowing any job to start, even if it would
  tke only 30% of the resources (whic are currently free) and would
  finish before the 2% are free anyway.
Backfill would allow such job to start, as long as it's wall time
  ensures it would finish before the 42% job would've started.


Fairtree in either case (fifo or backfill) calculates the
  priority for each job the same - if the account had used more
  resources recently (the halflife decay factor) it would get a
  lower priority even though it was submitted earlier than a job
  from an account that didn't use any resources recently.


As can be expected, backtree has to loop over all jobs in the
  queue, in order to see if any job can fit out of order. In very
  busy/active systems, that can lead to poor response times, unless
  tuned correctly in slurm conf - look at SchedulerParameters, all
  params starting with bf_ and in particular bf_max_job_test=
  ,bf_max_time= and bf_continue (but bf_window= can also have some
  impact if set too high).

see the man page at
  https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters


  


PriorityType=priority/multifactor

PriorityDecayHalfLife=2
  PriorityUsageResetPeriod=DAILY
  PriorityWeightFairshare=50
  PriorityFlags=FAIR_TREE



Regards

Navin.




  
  
  
On Mon, Apr 27, 2020 at 9:37
  PM Daniel Letai <d...@letai.org.il> wrote:


  
Are you sure there are enough resources available? The
  node is in mixed state, so it's configured for both
  partitions - it's possible that earlier lower priority
  jobs are already running thus blocking the later jobs,
  especially since it's fifo.


It would really help if you pasted the results of:
squeue
sinfo


As well as the exact sbatch line, so we can see how many
  resources per node are requested.



On 26/04/2020 12:00:06, navin srivastava wrote:


  Thanks Brian,


As suggested i gone through document and what i
  understood  that the fair tree leads to the Fairshare
  mechanism and based on that the job should be
  scheduling.


so it mean job scheduling will be based on FIFO but
  priority will be decided on the Fairshare. i am not
  sure if both conflicts here.if i see the normal jobs
  priority is lower than the GPUsmall priority. so
  resources are available with gpusmall partition then
  it should go. there is no job pend due to gpu
  resources. the gpu resources itself not asked with the
  job.


is there any article where i can see how the
  fairshare works an

Re: [slurm-users] not allocating jobs even resources are free

2020-04-27 Thread Daniel Letai
tiveFeatures=K2200
     Gres=gpu:2
     NodeAddr=node18 NodeHostName=node18
  Version=17.11
     OS=Linux 4.4.140-94.42-default #1 SMP
  Tue Jul 17 07:44:50 UTC 2018 (0b375e4)
     RealMemory=1 AllocMem=0 FreeMem=79532
  Sockets=2 Boards=1
     State=MIXED ThreadsPerCore=1 TmpDisk=0
  Weight=1 Owner=N/A MCS_label=N/A
     Partitions=GPUsmall,pm_shared
     BootTime=2019-12-10T14:16:37
  SlurmdStartTime=2019-12-10T14:24:08
     CfgTRES=cpu=36,mem=1M,billing=36
     AllocTRES=cpu=6
     CapWatts=n/a
     CurrentWatts=0 LowestJoules=0
  ConsumedJoules=0
     ExtSensorsJoules=n/s ExtSensorsWatts=0
  ExtSensorsTemp=n/s



node19:-


NodeName=node19 Arch=x86_64
  CoresPerSocket=18
     CPUAlloc=16 CPUErr=0 CPUTot=36
  CPULoad=15.43
     AvailableFeatures=K2200
     ActiveFeatures=K2200
     Gres=gpu:2
     NodeAddr=node19 NodeHostName=node19
  Version=17.11
     OS=Linux 4.12.14-94.41-default #1 SMP
  Wed Oct 31 12:25:04 UTC 2018 (3090901)
     RealMemory=1 AllocMem=0 FreeMem=63998
  Sockets=2 Boards=1
     State=MIXED ThreadsPerCore=1 TmpDisk=0
  Weight=1 Owner=N/A MCS_label=N/A
     Partitions=GPUsmall,pm_shared
     BootTime=2020-03-12T06:51:54
  SlurmdStartTime=2020-03-12T06:53:14
     CfgTRES=cpu=36,mem=1M,billing=36
     AllocTRES=cpu=16
     CapWatts=n/a
     CurrentWatts=0 LowestJoules=0
  ConsumedJoules=0
     ExtSensorsJoules=n/s ExtSensorsWatts=0
  ExtSensorsTemp=n/s



could you please help me to understand
  what could be the reason?


















  

  


  

  


  

  

-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




[slurm-users] Assigning gpu freq va;ues manually

2020-04-21 Thread Daniel Letai

  
  
Is it possible to assign gpu freq values without use of
  specialized plugin?


Currently gpu freqs can be assigned by use of
AutoDetect=nvml
Or
AutoDetect=rsmi


In gres.conf, but I can't find any reference to assigning freq
  values manually via direct input in gres.conf.
Is it possible to populate gpu freqs in gres.conf, or must I use
  autodetect if I want such functionality?


Thanks in advance,
--Dani_L.

  




Re: [slurm-users] Alternative to munge for use with slurm?

2020-04-18 Thread Daniel Letai

  
  
in v20.02 you can use jwt, as per
  https://slurm.schedmd.com/jwt.html


Only issue is getting libjwt for most rpm based distros.
The current libjwt configure;make dist-all doesn't work.
I had to cd into dist, and 'make rpm' to create the spec file,
  then rpmbuild -ba after placing the tar gz file in the SOURCES dir
  of rpmbuild tree.


Possibly just installing libjwt manually is easier for image
  based clusters.
HTH.





On 17/04/2020 22:42, Dean Schulze
  wrote:


  
  Is there an alternative to munge when running
slurm?  Munge issues are a common problem in slurm, and munge
doesn't give any useful information when a problem occurs.  An
alternative that at least gave some useful information when a
problem occurs would be a big improvement.


Thanks.
  

  




Re: [slurm-users] Need to execute a binary with arguments on a node

2019-12-18 Thread Daniel Letai

  
  
Use sbatch's wrapper command:


sbatch --wrap='ls -l /tmp'


Note that the output will be in the directory on the execution
  node, by default with the name slurm-.out


On 12/18/19 8:40 PM, William Brown
  wrote:


  
  Sometimes the way is to make the shell the binary,
e.g. bash -c 'ls -lsh'




  
  
  
On Wed, 18 Dec 2019, 18:25
  Dean Schulze, 
  wrote:


  This is a rookie question.  I can use the srun
command to execute a simple command like "ls" or "hostname"
on a node.   But I haven't found a way to add arguments like
"ls -lart".


What I need to do is execute a binary that takes
  arguments (like "a.out arg1 arg2 arg3) that exists on the
  node.


Is srun the right way to do this or do I need a script
  or something else?


Thanks.
  

  

  




Re: [slurm-users] Limiting the number of CPU

2019-11-14 Thread Daniel Letai

  
  
3 possible issue, inline below


On 14/11/2019 14:58:29, Sukman wrote:


  Hi Brian,

thank you for the suggestion.

It appears that my node is in drain state.
I rebooted the node and everything became fine.

However, the QOS still cannot be applied properly.
Do you have any opinion regarding this issue?


$ sacctmgr show qos where Name=normal_compute format=Name,Priority,MaxWal,MaxTRESPU
  Name   Priority MaxWall MaxTRESPU
-- -- --- -
normal_co+ 1000:01:00  cpu=2,mem=1G


when I run the following script:

#!/bin/bash
#SBATCH --job-name=hostname
#sbatch --time=00:50
#sbatch --mem=1M

I believe those should be uppercase #SBATCH

  
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodelist=cn110

srun hostname


It turns out that the QOSMaxMemoryPerUser has been met

$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
88  defq hostname   sukman PD   0:00  1 (QOSMaxMemoryPerUser)


$ scontrol show job 88
JobId=88 JobName=hostname
   UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A
   Priority=4294901753 Nice=0 Account=user QOS=normal_compute
   JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-11-14T19:55:50
   Partition=defq AllocNode:Sid=itbhn02:51072
   ReqNodeList=cn110 ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0

MinMemoryNode seems to require more than FreeMem in Node below

  
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sukman/script/test_hostname.sh
   WorkDir=/home/sukman/script
   StdErr=/home/sukman/script/slurm-88.out
   StdIn=/dev/null
   StdOut=/home/sukman/script/slurm-88.out
   Power=


$ scontrol show node cn110
NodeName=cn110 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn110 NodeHostName=cn110 Version=17.11
   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
   RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1

This would appear to be wrong - 56 sockets?
How did you configure the node in slurm.conf?
FreeMem lower than MinMemoryNode - not sure if that is relevant.


  
   State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23
   CfgTRES=cpu=56,mem=257758M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


---

Sukman
ITB Indonesia




- Original Message -
From: "Brian Andrus" 
To: slurm-users@lists.schedmd.com
Sent: Tuesday, November 12, 2019 10:41:42 AM
Subject: Re: [slurm-users] Limiting the number of CPU

You are trying to specifically run on node cn110, so you may want to 
check that out with sinfo

A quick "sinfo -R" can list any down machines and the reasons.

Brian Andrus



    
-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Daniel Letai

  
  

On 11/12/19 9:34 AM, Ole Holm Nielsen
  wrote:

On
  11/11/19 10:14 PM, Daniel Letai wrote:
  
  Why would you need galera-4 as a build
require?

  
  
  This is the MariaDB recommendation in
  https://mariadb.com/kb/en/library/yum/, see the section
  "Installing MariaDB Packages with YUM".  I have no clue why this
  would be needed.
  

Yes, it's required for mariadb multimaster cluster. This has
  nothing to do with the mariadb api required for linking against
  mariadb libs.


You don't even need the mariadb-server pkg for build purposes -
  it's only required for deployment of slurmdbd.
On a build machine, you should only require the client section:
https://mariadb.com/kb/en/library/yum/#installing-mariadb-clients-and-client-libraries-with-yum,
  as well as the devel pkg.


  
  /Ole
  
  
  
  If it's required by any of the mariadb
packages, it'll get pulled automatically. If not, you don't need
it on the build system.



On 11/11/19 10:56 PM, Ole Holm Nielsen wrote:

Hi William,
  
  
  Interesting experiences with MariaDB 10.4!  I tried to collect
  the instructions from the MariaDB page, but I'm unsure about
  how to get the galera-4 RPM.
  
  
  Could you kindly review and correct my updated instructions?
  
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms
  
  
  That said, what are the main reasons for installing MariaDB 10
  in stead of the 5.5 delivered by RedHat?  I'm not sure how
  well SchedMD has tested MariaDB 10 with Slurm?
  
  
  /Ole
  
  
  
  On 11-11-2019 21:23, William Brown wrote:
  
  I have in fact found the answer by
looking harder.


The config.log clearly showed that the build of the test
MySQL program failed, which is why it was set to be
excluded.


It failed to link against '-lmariadb'.  It turns out that
library is no longer in MariaDB or MariaDB-devel, it is
separately packaged in MariaDB-shared.  That may of course
be because I have built MariaDB 10.4 from the mariadb.org
site, because CentOS 7 only ships with the extremely old
version 5.5.


Once I installed the missing package it built the RPMs just
fine.  However it would be easier to use it linked to static
MariaDB libraries, as I now have to installed MariaDB-shared
on every server that will run slurmd, i.e. all compute
nodes.  I expect that if I looked harder at the build
options there may be a way to do this, perhaps with linker
flags.


For now, I can progress.


Thanks


William


-Original Message-

From: slurm-users
 On Behalf Of
Ole Holm Nielsen

Sent: 11 November 2019 20:02

To: slurm-users@lists.schedmd.com

Subject: Re: [slurm-users] RPM build error -
accounting_storage_mysql.so


Hi,


Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL
7? See
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms


Note in particular:

Important: Install the MariaDB (a
  replacement for MySQL) packages before you build Slurm
  RPMs (otherwise some libraries will be missing):
  
  
  yum install mariadb-server mariadb-devel
  


/Ole



On 11-11-2019 15:22, William Brown wrote:

Fabio
  
  
  Did you ever resolve the problem building
  accounting_storage_mysql.so?
  
  
  I have the exact same problem with CentOS 7.6, building
  Slurm 19.05.03.
  
  My command:
  
  
  rpmbuild -ta slurm-19.05.3-2.tar.bz2 | tee
  /var/tmp/slurm-build.log
  
  
  The directory with the plugin source is all there:
  
/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Daniel Letai

  
  
Why would you need galera-4 as a build require?


If it's required by any of the mariadb packages, it'll get pulled
  automatically. If not, you don't need it on the build system.


On 11/11/19 10:56 PM, Ole Holm Nielsen
  wrote:

Hi
  William,
  
  
  Interesting experiences with MariaDB 10.4!  I tried to collect the
  instructions from the MariaDB page, but I'm unsure about how to
  get the galera-4 RPM.
  
  
  Could you kindly review and correct my updated instructions?
  
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms
  
  
  That said, what are the main reasons for installing MariaDB 10 in
  stead of the 5.5 delivered by RedHat?  I'm not sure how well
  SchedMD has tested MariaDB 10 with Slurm?
  
  
  /Ole
  
  
  
  On 11-11-2019 21:23, William Brown wrote:
  
  I have in fact found the answer by looking
harder.


The config.log clearly showed that the build of the test MySQL
program failed, which is why it was set to be excluded.


It failed to link against '-lmariadb'.  It turns out that
library is no longer in MariaDB or MariaDB-devel, it is
separately packaged in MariaDB-shared.  That may of course be
because I have built MariaDB 10.4 from the mariadb.org site,
because CentOS 7 only ships with the extremely old version 5.5.


Once I installed the missing package it built the RPMs just
fine.  However it would be easier to use it linked to static
MariaDB libraries, as I now have to installed MariaDB-shared on
every server that will run slurmd, i.e. all compute nodes.  I
expect that if I looked harder at the build options there may be
a way to do this, perhaps with linker flags.


For now, I can progress.


Thanks


William


-Original Message-

From: slurm-users 
On Behalf Of Ole Holm Nielsen

Sent: 11 November 2019 20:02

To: slurm-users@lists.schedmd.com

Subject: Re: [slurm-users] RPM build error -
accounting_storage_mysql.so


Hi,


Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL 7? 
See
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms


Note in particular:

Important: Install the MariaDB (a
  replacement for MySQL) packages before you build Slurm RPMs
  (otherwise some libraries will be missing):
  
  
  yum install mariadb-server mariadb-devel
  


/Ole



On 11-11-2019 15:22, William Brown wrote:

Fabio
  
  
  Did you ever resolve the problem building
  accounting_storage_mysql.so?
  
  
  I have the exact same problem with CentOS 7.6, building Slurm
  19.05.03.
  
  My command:
  
  
  rpmbuild -ta slurm-19.05.3-2.tar.bz2 | tee
  /var/tmp/slurm-build.log
  
  
  The directory with the plugin source is all there:
  
/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accountin
  
  g_storage/mysql, with a Makefile that is the same date/time as
  the
  
  other accounting_storage alternatives.
  
  
  In the log I can see:
  
  
  checking for mysql_config... /usr/bin/mysql_config
  
  
  Looking at the process of building the RPMs it looks as if it
  has
  
  skipped trying to create the missing library file, but then
  expects to
  
  find it in the RPM.
  
  
  This is what I see when it is building, it builds the
  
  accounting_storage .so files for _fileext, _none and
  _slurmdbd, but
  
  not for _mysql.  I do have MariaDB-devel 10.4.10 installed
  
  
  .
  
  
  .
  
  
  Making all in mysql
  
  
  make[5]: Entering directory
  
`/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accounting_storage/mysql'
  
  
  make[5]: Nothing to be done for `all'.
  
  
  make[5]: Leaving directory
  
`/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accounting_storage/mysql'
  
  
  .
  
  
  .
  
  
  Making 

Re: [slurm-users] How to find core count per job per node

2019-10-21 Thread Daniel Letai

  
  
I can't test this right now, but possibly


squeue -j  -O 'name,nodes,tres-per-node,sct'


From squeue man page https://slurm.schedmd.com/squeue.html:
sct
      Number of requested sockets, cores, and threads (S:C:T) per
  node for the job. When (S:C:T) has not been set, "*" is displayed.
  (Valid for jobs only)
tres-per-node
      Print the trackable resources per node requested by the job or
  job step.


Again, can't test just now, so no idea if applicable to your use
  case.



On 10/18/19 9:51 PM, Mark Hahn wrote:


  $ scontrol --details show job 1653838

JobId=1653838 JobName=v1.20

  
  ...
  
      Nodes=r00g01 CPU_IDs=31-35 Mem=5120
GRES_IDX=

    Nodes=r00n16 CPU_IDs=34-35 Mem=2048 GRES_IDX=

    Nodes=r00n20 CPU_IDs=12-17,30-35 Mem=12288 GRES_IDX=

    Nodes=r01n16 CPU_IDs=15 Mem=1024 GRES_IDX=

  
  
  thanks for sharing this!  we've had a lot of discussion
  
  on how to collect this information as well, even whether it would
  be worth doing in a prolog script...
  
  
  regards,
  
  --
  
  Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca |
  http://www.sharcnet.ca
  
    | McMaster RHPCS    | h...@mcmaster.ca | 905 525 9140
  x24687
  
    | Compute/Calcul Canada    |
  http://www.computecanada.ca
  
  

  




[slurm-users] Using swap for gang mode suspended jobs only

2019-10-13 Thread Daniel Letai

  
  
Hi,


I'd like to allow job suspension in my cluster, without the
  "penalty" of RAM utilization. The jobs are sometimes very big and
  can require ~100GB mem on each node. Suspending such a job would
  usually mean almost nothing else can run on the same node, except
  for very small memory jobs.
Currently the solution is requeue preemption  with or without
  checkpointing.
I don't want to use swap for running jobs, ever - I'd rather get
  OOM killed than use swap while the job is running.



Is there a way to tell Slurm to allocate swap and use it only for
  suspending, to allow preemption without terminating the jobs?


The nodes have  ~TB of disk space each, and most jobs never
  utilize any of that (relying on shared storage instead), so local
  disk space is usually not a concern.


Using swap to store suspended jobs, while slow to freeze and
  thaw, seems o me to be a better localized solution than
  checkpointing and requeuing, allowing the job to resume
  "immediately" (sans disk io times) after the high priority job
  finishes, but if I'm mistaken, please enlighten me.


I was wandering if simply setting a large swap in linux, while
  setting AllowedSwapSpace=0 in cgroup.conf would work, but I
  suspect the following:
1. Even suspended, the job still remains in it's cgroup limits,
  and
2. Which process gets swapped is non-deterministic from my point
  of view - I'm not sure the kernel will swap out the suspended job
  rather than the new job, at least in it's early stages.

Thanks in advance,
--Dani_L.

  




Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-14 Thread Daniel Letai

  
  
Make tmpfs a TRES, and have NHC update that as in:
scontrol update nodename=... gres=tmpfree:$(stat -f /tmp -c
  "%f*%S" | bc)"
Replace /tmp with your tmpfs mount.


You'll have to define that TRES in slurm.conf and gres.conf as
  usual (start with count=1 and have nhc update it)


Do note that this is a simplistic example - updating like that
  will overwrite any other gres defined for the node, so you might
  wish to create an 'updategres' function that first reads in the
  node's current gres, only modifies the count of the fields you
  wish to modify, and returns a complete gres string.
 


In sbatch do:
sbatch --gres=tmpfree:20G
And based on last update from NHC should only consider nodes with
  enough tmpfree for the job.



HTH
--Dani_L.



On 9/10/19 10:15 PM, Ole Holm Nielsen
  wrote:

Hi
  Michael,
  
  
  Thanks for the suggestion!  We have user requests for certain
  types of jobs (quantum chemistry) that require fairly large local
  scratch space. Our jobs normally do not have this requirement.  So
  unfortunately the per-node NHC check doesn't seem to do the
  trick.  (We already have an NHC check "check_fs_used /scratch
  90%").
  
  
  Best regards,
  
  Ole
  
  
  
  On 10-09-2019 20:41, Michael Jennings wrote:
  
  On Monday, 02 September 2019, at 20:02:57
(+0200),

Ole Holm Nielsen wrote:


We have some users requesting that a
  certain minimum size of the
  
  *Available* (i.e., free) TmpFS disk space should be present on
  nodes
  
  before a job should be considered by the scheduler for a set
  of
  
  nodes.
  
  
  I believe that the "sbatch --tmp=size" option merely refers to
  the
  
  TmpFS file system *Size* as configured in slurm.conf, and this
  is
  
  *not* what users need.
  
  
  For example, a job might require 50 GB of *Available disk
  space* on
  
  the TmpFS file system, which may however have only 20 GB out
  of 100
  
  GB *Available* as shown by the df command, the rest having
  been
  
  consumed by other jobs (present or past).
  
  
  However, when we do "scontrol show node ",
  only the TmpFS
  
  file system *Size* is displayed as a "TmpDisk" number, but not
  the
  
  *Available* number.
  
  
  Question: How can we get slurmd to report back to the
  scheduler the
  
  amount of *Available* disk space?  And how can users specify
  the
  
  minimum *Available* disk space required by their jobs
  submitted by
  
  "sbatch"?
  
  
  If this is not feasible, are there other techniques that
  achieve the
  
  same goal?  We're currently still at Slurm 18.08.
  


Hi, Ole!


I'm assuming you are wanting a per-job resolution on this rather
than

per-node?  If per-node is good enough, you can of course use NHC
to

check this, e.g.:

   * || check_fs_free /tmp 50GB


That doesn't work per-job, though, obviously.  Something that
might

work, however, as a temporary work-around for this might be to
have

the user run a single NHC command, like this:

   srun --prolog='nhc -e "check_fs_free /tmp 50GB"'


There might be some tweaks/caveats to this since NHC normally
runs as

root, but just an idea  :-)  An even crazier idea would be
to set

NHC_LOAD_ONLY=1 in the environment, source /usr/sbin/nhc, and
then

execute the shell function `check_fs_free` directly!  :-D

  
  

  




Re: [slurm-users] Different Memory Nodes

2019-09-08 Thread Daniel Letai

  
  
Just a quick FYI - using gang mode preemption would mean the
  available memory would be lower, so if the preempting job requires
  the entire node memory, this will be an issue.




On 9/4/19 8:51 PM, Tina Fora wrote:


  Thanks Brian! I'll take a look at weights.

I want others to be able to use them and take advantage of the large
memory when free. We have a preemptable partiton below that works great.

PartitionName=scavenge
   AllowGroups=ALL AllowAccounts=ALL AllowQos=scavenge,abc
   AllocNodes=ALL Default=NO QoS=scavenge
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=4-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=...a[01-05,11-15],b[01-10],c[01-20]
   PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
...



  
(Added a subject)

Tina,

If you want group xxx to be the only ones to access them, you need to
either put them in their own partition or add info to the node
definitions to only allow certain users/groups.

If you want them to be used last, so they are available until all the
other nodes are in use, you can add weights to the node definitions.
This would mean users could request >192GB memory, so it has to go to
one of the updated nodes, which will only be taken if the other nodes
are used up, or a job needing > 192GB is running on them.

Brian Andrus

On 9/4/2019 9:53 AM, Tina Fora wrote:


  Hi,

I'm adding a bunch of memory on two of our nodes that are part of a
blade
chassis. So two computes will be upgraded to 1TB RAM and the rest have
192GB. All of the nodes belog to several partitons and can be used by
our
paid members given the partition below. I'm looking for ways to figure
out
how only group xxx (the ones paying for the memory upgrade) can get to
them.

PartitionName=member
AllowGroups=ALL AllowAccounts=ALL AllowQos=xxx,yyy,zzz
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=1 MaxTime=5-00:00:00 MinNodes=1 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=a[01-05,11-15],b[01-20]
PriorityJobFactor=500 PriorityTier=500 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP ...

Say compute a01 and a02 will have 1TB memory and I want group xxx to be
able to get to them quickly using the partition above.

Thanks,
Tina







  
  




  




Re: [slurm-users] Usage splitting

2019-09-01 Thread Daniel Letai

  
  
Wouldn't fairshare with a 90/10 split achieve this?


This will require accounting is set in your cluster, with the
  following parameters:



In slurm.conf set
AccountingStorageEnforce=associations # And possibly
  '...,limits,qos,safe' as required - so perhaps just use '=all'

PriorityType=priority/multifactor # Required by other parameters

PriorityDecayHalfLife=14-0 # Every 14 days (two weeks) reset
  usage
PriorityWeightFairshare=1 # With all other weights defaulting to
  0, ensures only fairshare influences priority.
TRESBillingWeights="Node=1" # According to docs, "Node" should be
  a TRES. I've never tested this.



And from cmdline add the fair share split via:



sacctmgr create account name=A fairshare=10
sacctmgr create account name=B fairshare=90


Than simply associate users to each account, and use something
  like 'sbatch --account=A ... '  to charge jobs to accounts.


This won't do exactly what you want - it might allow 'A' to
  utilize more than 10%, if the cluster is under utilized.


I'm not aware of a scheme where 'A' might be preempted only if it
  has been awarded more than it's fair share due to
  underutilization.
If the 10% hard limit is a concern, it might be worth to
  investigate reservations, and allocate to 'A' only from a 10%
  reservation, while somehow allowing 'B' to utilize that
  reservation too if required.





On 30/08/2019 14:14:16, Stefan
  Staeglich wrote:


  Hi,

we have some compute nodes paid by different project owners. 10% are owned by 
project A and 90% are owned by project B.

We want to implement the following policy such that every certain time period 
(e.g. two weeks):
- Project A doesn't use more than 10% of the cluster in this time period
- But project B is allowed to use more than 90%

What's the best way to enforce this?

Best,
Stefan


-- 
HTH,

--Dani_L.

  




Re: [slurm-users] sacctmgr dump question - how can I dump entities other than cluster?

2019-08-13 Thread Daniel Letai

  
  
The cluster config doesn't contain qos rules definitions. It only
  contains mappings of qos to users/accounts.
Currently it is impossible to dump and edit qos rules, although
  it is possible to add and remove defined qos from/to users.


Regards,
--Dani_L.



On 8/12/19 8:44 AM, Barbara Krašovec
  wrote:


  
  Yes, afaik you can only dump the whole cluster config, not a
specific entity. If you dump the cluster config, also qos rules
are included, so you can modify the rules in the cluster config
and load it.
  
  
  If you don't want to do that,  then just use the sacctmgr
modify option.
  
  
  Cheers,
  Barbara
  
  
  On 8/5/19 12:02 PM, Daniel Letai
wrote:
  
  


The documentation clearly states 

dump <ENTITY> <File=FILENAME>
 Dump cluster data to the specified file. If the filename is
  not specified it uses clustername.cfg filename by default. 
However, the only entity sacctmgr dump seems to accept is
  .
Glancing over the code at https://github.com/SchedMD/slurm/blob/master/src/sacctmgr/cluster_functions.c#L1006

it doesn't seem like sacctmgr will accept anything other than
  a cluster name either.


How can I easily dump to file qos rules, in a way that would
  allow me to modify and upload new qos as required?


BTW, just noticed "archive" is not in the 'commands' section
  of sacctmgr man, but is treated as a command in later sections
  of the man page.

  

  




[slurm-users] sacctmgr dump question - how can I dump entities other than cluster?

2019-08-05 Thread Daniel Letai

  
  
The documentation clearly states 

dump  
  
Dump cluster data to the specified file. If the filename is not
specified
it uses clustername.cfg filename by default.
  

However, the only entity sacctmgr dump seems to accept is
  .
Glancing over the code at
https://github.com/SchedMD/slurm/blob/master/src/sacctmgr/cluster_functions.c#L1006

it doesn't seem like sacctmgr will accept anything other than a
  cluster name either.


How can I easily dump to file qos rules, in a way that would
  allow me to modify and upload new qos as required?


BTW, just noticed "archive" is not in the 'commands' section of
  sacctmgr man, but is treated as a command in later sections of the
  man page.

  




Re: [slurm-users] Slurm configuration

2019-08-05 Thread Daniel Letai

  
  
Hi.


On 8/3/19 12:37 AM, Sistemas NLHPC
  wrote:


  
  Hi all,

Currently we have two types of nodes, one with 192GB and another
with 768GB of RAM, it is required that in nodes of 768 GB it is
not allowed to execute tasks with less than 192GB, to avoid
underutilization of resources.

This, because we have nodes that can fulfill the condition of
executing tasks with 192GB or less.

Is it possible to use some slurm configuration to solve this
problem?
  

Easiest would be to use features/constraints. In slurm.conf add
NodeName=DEFAULT RealMemory=196608 Features=192GB Weight=1

NodeName=... (list all nodes with 192GB)
NodeName=DEFAULT RealMemory=786432 Features=768GB Weight=2
NodeName=... (list all nodes with 768GB)


And to run jobs only on node with 192GB in sbatch do
sbatch -C 192GB ...


To run jobs on all nodes, simply don't add the constraint to the
  sbatch line, and due to lower weight jobs should prefer to start
  on the 192GB nodes.


  
PD: All users can submit jobs on all nodes

Thanks in advance 

Regards.

  

  




Re: [slurm-users] Unexpected MPI process distribution with the --exclusive flag

2019-07-31 Thread Daniel Letai

  
  

On 7/30/19 6:03 PM, Brian Andrus wrote:


  
  I think this may be more on how you are calling mpirun and the
mapping of processes.
  With the "--exclusive" option, the processes are given access
to all the cores on each box, so mpirun has a choice. IIRC, the
default is to pack them by slot, so fill one node, then move to
the next. Whereas you want to map by node (one process per node
cycling by node)
  From the man for mpirun (openmpi):
  
  
--map-by 
Map to the specified object, defaults to socket.
  Supported options include slot, hwthread, core, L1cache,
  L2cache, L3cache, socket, numa, board, node, sequential,
  distance, and ppr. Any object can include modifiers by adding
  a : and any combination of PE=n (bind n processing elements to
  each proc), SPAN (load balance the processes across the
  allocation), OVERSUBSCRIBE (allow more processes on a node
  than processing elements), and NOOVERSUBSCRIBE. This includes
  PPR, where the pattern would be terminated by another colon to
  separate it from the modifiers.
  
  
  
  so adding "--map-by node" would give
you what you are looking for.
  Of course, this syntax is for
Openmpi's mpirun command, so YMMV
  

If using srun (as recommended) instead of invoking mpirun
  directly, you can still achieve the same functionality using
  exported environment variables as per the mpirun man page, like
  this:
OMPI_MCA_rmaps_base_mapping_policy=node srun --export
  OMPI_MCA_rmaps_base_mapping_policy ...
in you sbatch script.


   
  Brian Andrus
  
  
  
  
  
  On 7/30/2019 5:14 AM, CB wrote:
  
  

Hi Everyone,
  
  
  I've recently discovered that when an MPI job is
submitted with the --exclusive flag, Slurm fills up each
node even if the --ntasks-per-node flag is used to set how
many MPI processes is scheduled on each node.   Without the
--exclusive flag, Slurm works fine as expected.
  
  
  Our system is running with Slurm 17.11.7.
  
  
  The following options works that each node has 16 MPI
processes until all 980 MPI processes are scheduled.with
total of 62 compute nodes.  Each of the 61 nodes has 16 MPI
processes and the last one has 4 MPI processes, which is 980
MPI processes in total.
  #SBATCH -n 980                                 
#SBATCH --ntasks-per-node=16
  
  
  
  However, if the --exclusive option is added, Slurm fills
up each node with 28 MPI processes (the compute node has 28
cores).  Interestingly, Slurm still allocates  62 compute
nodes although  only 35 nodes of them are actually used to
distribute 980 MPI processes.
  
  
  
#SBATCH -n 980                                 
  #SBATCH --ntasks-per-node=16

#SBATCH --exclusive

  
  
  
  Has anyone seen this behavior?
  
  
  Thanks,
  - Chansup

  

  




Re: [slurm-users] Can I use the manager as compute node

2019-07-30 Thread Daniel Letai

  
  
Yes, just add it to the Nodes= list of the partition.
You will have to install slurm-slurmd on it as well, and enable
  and start as on any compute node, or it will be DOWN.


HTH,
--Dani_L.


On 7/30/19 3:45 PM, wodel youchi wrote:


  
  Hi,


I am newbie in Slurm,


All examples I saw when they declare the Partition, only
  compute nodes are used.
My question is : can I use the manager or the slurmctldhost
  (the master host) as a compute node in and extended partition
  for example?
if yes how?


Regards.
  

  




Re: [slurm-users] Weekend Partition

2019-07-23 Thread Daniel Letai

  
  
I would use a partition with very low priority and preemption.
General cluster conf:

PreemptType=preempt/partition_prio
Preemptmode=Cancel # Anything except 'Off'



Partition definition:
  ParttionName=weekend PreemptMode=Cancel MaxTime=Unlimited
PriorityTier=1 State=Down
  
  
  
  Use cron to 'scontrol update PartitionName=weekend state=up'
when desired and 'scontrol update PartitionName=weekend
state=down' on Sunday.

This will not cancel the jobs on it's own, but will prevent new
  ones from starting.
The preemption will kill jobs as required to allow regular jobs
  to run - the added value is that as long as they don't prevent
  other jobs from starting, those jobs can continue, and won't be
  killed needlessly.


Just my 2 cents.


The other option is to use a recurring reservation with a start
  and stop time frame, and force jobs to use that reservation 
  (possibly via qos).
This solution might look something like:
scontrol create reservation StartTime=00:00:01 Duration= Flags=
For Flags you have a couple of options:
WEEKEND
      Repeat the reservation at the same time on every weekend day
  (Saturday and Sunday). 
  WEEKLY
      Repeat the reservation at the same time every week.
So I would guess Duration=1-0 Flags=WEEKEND
Or Duration=2-0 Flags=WEEKLY


You will have to test to see what works best for you.


  
HTH
--Dani_L.
  

  
On 7/23/19 7:36 PM, Matthew BETTINGER
  wrote:


  Hello,

We run lsf and slurm here.  For LSF we have a weekend queue with no limit and jobs get killed after Sunday.  What is the best way to do something similar for slurm?  Reservation?  We would like to have any running jobs killed after Sunday if possible too.  Thanks. 



  




Re: [slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2019-07-10 Thread Daniel Letai

  
  
Thank you Artem,


I've made a mistake while typing the mail, in all cases it was
  'OMPI_MCA_pml=ucx' and not as written. When I went over the mail
  before sending, I must have erroneously 'fixed' it for some
  reason. 



Best regards,
--Dani_L.



On 7/9/19 9:06 PM, Artem Polyakov
  wrote:


  
  
  
  
  

  Hello, Daniel 
  
  
  Let me try to reproduce locally
and get back to you.




  
  Best regards, 
  Artem Y. Polyakov, PhD
  Senior Architect, SW
  Mellanox Technologies

  
  
  От: p...@googlegroups.com
  
  от имени Daniel Letai 
  Отправлено: Tuesday, July 9, 2019 3:25:22 AM
  Кому: Slurm User Community List; p...@googlegroups.com;
  ucx-gr...@elist.ornl.gov
  Тема: [pmix] [Cross post - Slurm, PMIx, UCX] Using srun
  with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output
  error
 
  
  
Cross posting to Slurm, PMIx and UCX lists.



Trying to execute a simple openmpi (4.0.1) mpi-hello-world
  via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX
  (1.5.0) results in:


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun
  --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
  --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello



slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668
  [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed:
  Input/output error
  slurmstepd: error: n1 [0] pmixp_dconn.h:243
  [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct
  connection to n2 (1)
  slurmstepd: error: n1 [0] pmixp_server.c:731
  [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to
  1
  srun: Job step aborted: Waiting up to 32 seconds for job step
  to finish.
  slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect]
  mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
  slurmstepd: error: n2 [1] pmixp_dconn.h:243
  [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct
  connection to n1 (0)
  slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT
  2019-07-01T13:20:36 ***
  slurmstepd: error: n2 [1] pmixp_server.c:731
  [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to
  0
  srun: error: n2: task 1: Killed
  srun: error: n1: task 0: Killed


However, the following works:


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun
  --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
  --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello

  n2: Process 1 out of 2
  n1: Process 0 out of 2


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
  --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello


n2: Process 1 out of 2
  n1: Process 0 out of 2


Executing mpirun directly (same env vars, without the slurm
  vars) works, so UCX appears to function correctly.



If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
  SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout
  errors from mellanox/hcoll and glibc detected
  /data/mpihello/mpihello: malloc(): memory corruption (fast) 



Can anyone help using PMIx direct connection with UCX in
  Slurm?







Some info about my setup:


UCX version

[root@n1 ~]# ucx_info -v
# UCT version=1.5.0 revision 02078b9

[slurm-users] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2019-07-09 Thread Daniel Letai

  
  
Cross posting to Slurm, PMIx and UCX lists.



Trying to execute a simple openmpi (4.0.1) mpi-hello-world via
  Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0)
  results in:


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
  SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N
  2 -n 2 /data/mpihello/mpihello



slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect]
  mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
  slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect]
  mpi/pmix: ERROR: Cannot establish direct connection to n2 (1)
  slurmstepd: error: n1 [0] pmixp_server.c:731
  [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1
  srun: Job step aborted: Waiting up to 32 seconds for job step to
  finish.
  slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect]
  mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
  slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect]
  mpi/pmix: ERROR: Cannot establish direct connection to n1 (0)
  slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT
  2019-07-01T13:20:36 ***
  slurmstepd: error: n2 [1] pmixp_server.c:731
  [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 0
  srun: error: n2: task 1: Killed
  srun: error: n1: task 0: Killed


However, the following works:


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=false
  UCX_TLS=rc,shm srun --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N
  2 -n 2 /data/mpihello/mpihello

  n2: Process 1 out of 2
  n1: Process 0 out of 2


[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
  SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
  OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' 
  SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
  UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N
  2 -n 2 /data/mpihello/mpihello


n2: Process 1 out of 2
  n1: Process 0 out of 2


Executing mpirun directly (same env vars, without the slurm vars)
  works, so UCX appears to function correctly.



If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
  SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout
  errors from mellanox/hcoll and glibc detected
  /data/mpihello/mpihello: malloc(): memory corruption (fast) 



Can anyone help using PMIx direct connection with UCX in Slurm?







Some info about my setup:


UCX version

[root@n1 ~]# ucx_info -v
# UCT version=1.5.0 revision 02078b9
  # configured with: --build=x86_64-redhat-linux-gnu
  --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu
  --program-prefix= --prefix=/usr --exec-prefix=/usr
  --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
  --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
  --libexecdir=/usr/libexec --localstatedir=/var
  --sharedstatedir=/var/lib --mandir=/usr/share/man
  --infodir=/usr/share/info --disable-optimizations
  --disable-logging --disable-debug --disable-assertions --enable-mt
  --disable-params-check


Mellanox OFED version:

[root@n1 ~]# ofed_info -s
  OFED-internal-4.5-1.0.1:


Slurm:

slurm was built with:
  rpmbuild -ta slurm-19.05.0.tar.bz2 --without debug --with ucx
  --define '_with_pmix --with-pmix=/usr'


PMIx:

[root@n1 ~]# pmix_info -c --parsable
  config:user:root
  config:timestamp:"Mon Mar 25 09:51:04 IST 2019"
  config:host:slurm-test
  config:cli: '--host=x86_64-redhat-linux-gnu'
  '--build=x86_64-redhat-linux-gnu' '--program-prefix='
  '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin'
  '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share'
  '--includedir=/usr/include' '--libdir=/usr/lib64'
  '--libexecdir=/usr/libexec' '--localstatedir=/var'
  '--sharedstatedir=/var/lib' '--mandir=/usr/share/man'
  '--infodir=/usr/share/info'


Thanks,
--Dani_L.

  




Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai

  
  
I had similar problems in the past.
The 2 most common issues were:
1. Controller load - if the slurmctld was in heavy use, it
  sometimes didn't respond in timely manner, exceeding the timeout
  limit.
2. Topology and msg forwarding and aggregation.


For 2 - it would seem the nodes designated for forwarding are
  statically assigned based on topology. I could be wrong, but
  that's my observation, as I would get the socket timeout error
  when they had issues, even though other nodes in the same topology
  'zone' were ok and could be used instead.


It took debug3 to observe this in the logs, I think.


HTH
--Dani_L.



On 6/11/19 5:27 PM, Steffen Grunewald
  wrote:


  On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:

  
Hi 

Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:

+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

  
  
I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...


  
Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". 

The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to 
# scontrol show config
(..._)
SlurmctldDebug  = info

But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?

Thnaks for your attention.

Best Regards

mg.


  
  
- S




  




Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Daniel Letai

  
  
Hi Loris,


On 3/21/19 6:21 PM, Loris Bennett
  wrote:

Chris, maybe
  you should look at EasyBuild
  (https://easybuild.readthedocs.io/en/latest/).  That way you can install
all the dependencies (such as zlib) as modules and be pretty much
independent of the ancient packages your distro may provide (other

Cheers,

Loris




Do you have experience with spack or flatpak too? They all seem
  to solve the same problem, and I'd be interested in any comparison
  based on experience.
https://spack.readthedocs.io/en/latest/
http://docs.flatpak.org/en/latest/

  




Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-21 Thread Daniel Letai

  
  
Hi Peter,


On 3/20/19 11:19 AM, Peter Steinbach
  wrote:

[root@ernie
  /]# scontrol show node -dd g1
  
  NodeName=g1 CoresPerSocket=4
  
     CPUAlloc=3 CPUTot=4 CPULoad=N/A
  
     AvailableFeatures=(null)
  
     ActiveFeatures=(null)
  
     Gres=gpu:titanxp:2
  
     GresDrain=N/A
  
     GresUsed=gpu:titanxp:0(IDX:N/A)
  
     NodeAddr=127.0.0.1 NodeHostName=localhost Port=0
  

If the following is true
  
  RealMemory=4000 AllocMem=4000 FreeMem=N/A Sockets=1 Boards=1
  

That is - all memory is allocated for the job, then I don't think
any new job will enter, regardless of gres.
  
  State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
  MCS_label=N/A
  
     Partitions=gpu
  
     BootTime=2019-03-18T10:14:18
  SlurmdStartTime=2019-03-20T09:07:45
  
     CfgTRES=cpu=4,mem=4000M,billing=4
  
     AllocTRES=cpu=3,mem=4000M
  
     CapWatts=n/a
  
     CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
  
     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
  
  
  I now filled the 'cluster' with non-gres jobs and  I submitted a
  GPU job:
  



  




Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Daniel Letai

  
  
Hi.
On 12/03/2019 22:53:36, Riccardo
  Veraldi wrote:


  
  

  

  

  Hello,
  after trynig hard for over 10 days I am forced to
write to the list.
  I am not able to have SLURM work with openmpi.
Openmpi compiled binaries won't run on slurm, while
all non openmpi progs run just fine under "srun". I
am using SLURM 18.08.5 building the rpm from the
tarball: rpmbuild -ta slurm-18.08.5-2.tar.bz2
  
  prior to bulid SLURM I installed openmpi 4.0.0
which has built in pmix support. the pmix libraries
are in /usr/lib64/pmix/ which is the default
installation path.
  
  
  The problem is that hellompi is not working if I
launch in from srun. of course it runs outside
slurm.
  
  
  [psanagpu105:10995] OPAL ERROR: Not initialized
in file pmix3x_client.c at line 113
--
The application appears to have been direct launched
using "srun",
but OMPI was not built with SLURM's PMI support and
therefore cannot
execute. There are several options for building PMI
support under
  

  

  

  

I would guess (but having the config.log files would verify it)
  that you should rebuild Slurm --with-pmix and then you should
  rebuild OpenMPI --with Slurm.
Currently there might be a bug in Slurm's configure file building
  PMIx support without path, so you might either modify the spec
  before building (add --with-pmix=/usr to the configure section) or
  for testing purposes ./configure --with-pmix=/usr; make; make
  install.



It seems your current configuration has built-in mismatch - Slurm
  only supports pmi2, while OpenMPI only supports PMIx. you should
  build with at least one common PMI: either external PMIx when
  building  Slurm, or Slurm's PMI2 when building OpenMPI.
However, I would have expected the non-PMI option (srun
  --mpi=openmpi) to work even in your env, and Slurm should have
  built PMIx support automatically since it's in default search
  path.




  

  

  

  SLURM, depending upon the SLURM version you are
using:

  version 16.05 or later: you can use SLURM's PMIx
support. This
  requires that you configure and build SLURM
--with-pmix.

  Versions earlier than 16.05: you must use either
SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or
you can manually
  install PMI-2. You must then build Open MPI using
--with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
***    and potentially your MPI job)
[psanagpu105:10995] Local abort before MPI_INIT
completed completed successfully, but am not able to
aggregate error messages, and not able to guarantee
that all other processes were killed!
srun: error: psanagpu105: task 0: Exited with exit
code 1
  
  
  
  I really have no clue. I even reinstalled openmpi
on a specific different path /opt/openmpi/4.0.0
  anyway seems like slurm does not know how to fine
the MPI libraries even though they are there and
right now in the default path /usr/lib64
  
  
  even using --mpi=pmi2 or --mpi=openmpi does not
fix the problem and the same error message is given
to me.
  srun --mpi=list
srun: MPI types 

Re: [slurm-users] Visualisation -- Slurm and (Turbo)VNC

2019-01-03 Thread Daniel Letai

  
  
I haven't done this in a long time, but this blog entry might be
  of some use (I believe I did something similar when required in
  the past) :
https://summerofhpc.prace-ri.eu/remote-accelerated-graphics-with-virtualgl-and-turbovnc/


On 03/01/2019 12:14:52, Baker D.J.
  wrote:


  
  
  
Hello,


We have set up our
  NICE/DCV cluster and that is proving to be very popular. There
  are, however, users who would benefit from using the resources
  offered by our nodes with multiple GPU cards. This potentially
  means setting up TurboVNC, for example. I would, if possible,
  like to be able to make the process of starting a VNC server
  as painless as possible. I wondered if anyone had written a
  slurm script that users could modify/submit to reserve
  resources and start the VNC server. 


If you have such a
  template script and/or any advice in using VNC via slurm then
  I would be interested to hear from you please. Many of our
  visualization users are not "expert user" and so, as I note
  above, it would be useful to try to make the process as
  painless as possible, If you would be happy to share your
  script with us please then that would be appreciated. 


Best regards,
David


  

-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




Re: [slurm-users] Can frequent hold-release adversely affect slurm?

2018-10-19 Thread Daniel Letai

  
  


On 18/10/2018 20:34, Eli V wrote:


  On Thu, Oct 18, 2018 at 1:03 PM Daniel Letai  wrote:

  


Hello all,


To solve a requirement where a large number of job arrays (~10k arrays, each with at most 8M elements) with same priority should be executed with minimal starvation of any array - we don't want to wait for each array to complete before starting the next one - we wish to implement "interleaving" between arrays, we came up with the following scheme:


Start all arrays in this partition in a "Hold" state.

Release a predefined number of elements (E.g., 200)

from this point a slurmctld prolog takes over:

On the 200th job run squeue, note the next job array (array id following the currently executing array id)

Release a predefined number of elements (E.g., 200)

and repeat


This might produce a very large number of release requests to the scheduler in a short time frame, and one concern is the scheduler loop getting too many requests.

Can you think of other issues that might come up with this approach?


Do you have any recommendations, or might suggest a better approach to solve this problem?

  
  
I can't comment on the scalability issues but if possible using %200
on the array submission seems like the simplest solution. From the
sbatch man page:
For  example "--array=0-15%4" will limit the number of simultaneously
running tasks from this job array to 4.

Won't achieve the same solution - we want a round-robin solution,
your proposal would hard limit to 200 jobs per array. This will
leave most of the cluster underutilized.

  


  

We have considered fairshare, but all arrays are from same account and user. We have considered creating accounts on the fly (1 for each array) but get an error ("This should never happen") after creating a few thousand accounts.

To my understanding fairshare is only viable between accounts.

      
      



-- 
Regards,

Daniel Letai
+972 (0)505 870 456
  




[slurm-users] Can frequent hold-release adversely affect slurm?

2018-10-18 Thread Daniel Letai

  
  

Hello all,



To solve a requirement where a large number of job arrays (~10k
  arrays, each with at most 8M elements) with same priority should
  be executed with minimal starvation of any array - we don't want
  to wait for each array to complete before starting the next one -
  we wish to implement "interleaving" between arrays, we came up
  with the following scheme:


Start all arrays in this partition in a "Hold" state.
Release a predefined number of elements (E.g., 200)
from this point a slurmctld prolog takes over:

On the 200th job run squeue, note the next job array (array id
  following the currently executing array id)
Release a predefined number of elements (E.g., 200)
and repeat



This might produce a very large number of release requests to the
  scheduler in a short time frame, and one concern is the scheduler
  loop getting too many requests.
Can you think of other issues that might come up with this
  approach?



Do you have any recommendations, or might suggest a better
  approach to solve this problem?


We have considered fairshare, but all arrays are from same
  account and user. We have considered creating accounts on the fly
  (1 for each array) but get an error ("This should never happen")
  after creating a few thousand accounts.
To my understanding fairshare is only viable between accounts.

  




Re: [slurm-users] Is it possible to select the BatchHost for a job through some sort of prolog script?

2018-07-09 Thread Daniel Letai




On 06/07/2018 10:22, Steffen Grunewald wrote:

On Fri, 2018-07-06 at 07:47:16 +0200, Loris Bennett wrote:

Hi Tim,

Tim Lin  writes:


As the title suggests, I’m searching for a way to have tighter control of which
node the batch script gets executed on. In my case it’s very hard to know which
node is best for this until after all the nodes are allocate, right before the
batch job starts . I’ve looked through all the documentation I can get my hands
on but I haven’t found any mention of any control over the batch host for
admins. Am I missing something?

As the documentation of 'sbatch' says:

   "When the job allocation is finally granted for the batch script,
   Slurm runs a single copy of the batch script on the first node in the
   set of allocated nodes. "
   
I am not aware of any way of changing this.


Perhaps you can explain why you feel it is necessary for you do this.

For me, the above reads like the user has an idea of a metric for how to select
the node for rank-0 (and perhaps the code is sufficiently asymmetric to justify
such a selection), but no way to tell Slurm about it.
What about making the batch script a wrapper around the real payload, on the
"outer first node" take the list of assigned nodes and possibly reorder it, then
run the payload (via passphrase-less ssh?) on the selected, "new first" node?
Why not just use salloc instead? Allocate all the nodes for the job, 
then use the script to select (ssh?) the master and start the actual job 
there.


I'm still not sure why that would be necessary, though. Could you give a 
clear example of the master selection process? What metric/constraint is 
involved, and why can it only be obtained after node selection?

This may require changing some more environment variables, and may harm 
signalling.

Okay, my suggestion reads like a terrible kludge (which it certainly is), but
AFAIK there's no way to tell Slurm about "preferred first nodes".

- S