[slurm-users] slurm jobs are pending but resources are available

Marius.Cetateanu Mon, 16 Apr 2018 03:37:30 -0700

Hi,

I'm having some trouble with resource allocation in the sense that according to 
how I understood 
the documentation and applied that to the config file I am expecting some 
behavior that does not happen.


Here is the relevant excerpt from the config file:

 60 SchedulerType=sched/backfill                                                
                                             
 61 
SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000
                                     
 62 #SchedulerAuth=                                                             
                                             
 63 #SchedulerPort=                                                             
                                             
 64 #SchedulerRootFilter=                                                       
                                             
 65 SelectType=select/cons_res                                                  
                                             
 66 SelectTypeParameters=CR_CPU_Memory                                          
                                             
 67 FastSchedule=1
...      
 102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 
RealMemory=256000  State=UNKNOWN                       
 103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES 
MaxTime=76:00:00 State=UP

According to the above I have the backfill scheduler enabled with CPUs and 
Memory configured as 
resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect 
that he backfill 
scheduler attempts to allocate the resources in order to fill as much of the 
cores as possible if there 
are multiple processes asking for more resources than available. In my case I 
have the following queue:

  JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
     2361 main_comp     training    mcetatea PD       0:00      1           
(Resources)
     2356 main_comp     skrf_ori    jhanca     R        58:41      1          
cn_burebista
     2357 main_comp     skrf_ori    jhanca     R        44:13      1          
cn_burebista

Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, 
meaning in total 52 CPUs
As seen from above job 2361(which is started by a different user) is marked as 
pending due to lack of resources although there are plenty of CPUs and memory 
available.  "scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
   CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
   OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I'm going through the documentation again and again but I cannot figure out 
what am I doing wrong ...
Why do I have the above situation? What should I change to my config to make 
this work?

scontrol show -dd job <jobid> shows me the following:

JobId=2361 JobName=training_carlib
   UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
   Priority=4294901726 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
   SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
   StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=main_compute AllocNode:Sid=zalmoxis:23690
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cn_burebista
   NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   
Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
   
WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
   
StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
   StdIn=/dev/null
   StdOut=/home/mcetateanu/workspace/CarLib/src/_out

I also changed my config to specify exactly the numver of CPUs and to not let 
slurm compute the CPUs 
from Sockets, CoresPerSocket, and ThreadsPerCore. The 2 tasks that I am trying 
to run have the following 
output from "scontrol show -dd job <jobid>" but the one asking for 20 CPUs is 
still pending due to lack of resources:

NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* 
TRES=cpu=16,mem=32000M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* 
Nodes=cn_burebista CPU_IDs=0-15 Mem=32000 MinCPUsNode=16 MinMemoryCPU=2000M 
MinTmpDiskNode=0 

NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:* 
TRES=cpu=20,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* 

Thank you

-------------------------------------------------------------------------------------------
Marius Cetateanu
Senior Embedded Software Engineer
Engineering Department 1, Driver & Embedded
Sony Depthsensing Solutions
Tel:  +32 (0)28992171
email: marius.cetate...@sony.com

Sony Depthsensing Solutions
11 Boulevard de la Plaine, 1050 Brussels, Belgium

**********************************************************************
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the sender. This
footnote also confirms that this email message has been checked for all
known viruses.
Sony DepthSensing Solutions SA/NV
Registered Office: 11 Boulevard de la Plaine, 1050 Brussels, Belgium
Registered number: RPM/RPR Brussels 0811 784 189
**********************************************************************

________________________________________
From: slurm-users [slurm-users-boun...@lists.schedmd.com] on behalf of 
slurm-users-requ...@lists.schedmd.com [slurm-users-requ...@lists.schedmd.com]
Sent: Sunday, April 15, 2018 9:02 PM
To: slurm-users@lists.schedmd.com
Subject: slurm-users Digest, Vol 6, Issue 21

Send slurm-users mailing list submissions to
        slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C531c46b911e643cc3bad08d5a303860b%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=0J4phgqhMDHOFVqXITNuNY62BWyprqriA75AvslDMG8%3D&reserved=0
or, via email, send a message with subject or body 'help' to
        slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
        slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: ulimit in sbatch script (Mahmood Naderan)
   2. Re: ulimit in sbatch script (Bill Barth)
   3. Re: ulimit in sbatch script (Mahmood Naderan)
   4. Re: ulimit in sbatch script (Mahmood Naderan)
   5. Re: ulimit in sbatch script (Bill Barth)


----------------------------------------------------------------------

Message: 1
Date: Sun, 15 Apr 2018 22:56:01 +0430
From: Mahmood Naderan <mahmood...@gmail.com>
To: ole.h.niel...@fysik.dtu.dk,  Slurm User Community List
        <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <cada2p2xsyw0tbvgjubi_yrpddo15jalkssqqxdzgzczd8vc...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.

When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.

Regards,
Mahmood




On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
<ole.h.niel...@fysik.dtu.dk> wrote:
> Hi Mahmood,
>
> It seems your compute node is configured with this limit:
>
> virtual memory          (kbytes, -v) 72089600
>
> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
>
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might encounter
> Out Of Memory errors...
>
> /Ole



------------------------------

Message: 2
Date: Sun, 15 Apr 2018 18:31:08 +0000
From: Bill Barth <bba...@tacc.utexas.edu>
To: Slurm User Community List <slurm-users@lists.schedmd.com>,
        "ole.h.niel...@fysik.dtu.dk" <ole.h.niel...@fysik.dtu.dk>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <6218364a-07c8-4a75-b90a-a7ae77ebe...@tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? 
That would be enforcing /etc/security/limits.conf for all users which are 
usually unlimited for root. Root’s almost always allowed to do stuff bad enough 
to crash the machine or run it out of resources. If the /etc/pam.d/sshd file 
has pam_limits.so in it, that’s probably where the unlimited setting for root 
is coming from.

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445



On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan" 
<slurm-users-boun...@lists.schedmd.com on behalf of mahmood...@gmail.com> wrote:

    I actually have disabled the swap partition (!) since the system goes
    really bad and based on my experience I have to enter the room and
    reset the affected machine (!). Otherwise I have to wait for long
    times to see it get back to normal.

    When I ssh to the node with root user, the ulimit -a says unlimited
    virtual memory. So, it seems that the root have unlimited value while
    users have limited value.

    Regards,
    Mahmood




    On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
    <ole.h.niel...@fysik.dtu.dk> wrote:
    > Hi Mahmood,
    >
    > It seems your compute node is configured with this limit:
    >
    > virtual memory          (kbytes, -v) 72089600
    >
    > So when the batch job tries to set a higher limit (ulimit -v 82089600) 
than
    > permitted by the system (72089600), this must surely get rejected, as you
    > have discovered!
    >
    > You may want to reconfigure your compute nodes' limits, for example by
    > setting the virtual memory limit to "unlimited" in your configuration. If
    > the nodes has a very small RAM memory + swap space size, you might 
encounter
    > Out Of Memory errors...
    >
    > /Ole




------------------------------

Message: 3
Date: Sun, 15 Apr 2018 23:01:32 +0430
From: Mahmood Naderan <mahmood...@gmail.com>
To: ole.h.niel...@fysik.dtu.dk,  Slurm User Community List
        <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <CADa2P2U-9Pxm0oPT-DkmjzBDa66uk2z=tr-69X=p5woawph...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

BTW, the memory size of the node is 64GB.
Regards,
Mahmood




On Sun, Apr 15, 2018 at 10:56 PM, Mahmood Naderan <mahmood...@gmail.com> wrote:
> I actually have disabled the swap partition (!) since the system goes
> really bad and based on my experience I have to enter the room and
> reset the affected machine (!). Otherwise I have to wait for long
> times to see it get back to normal.
>
> When I ssh to the node with root user, the ulimit -a says unlimited
> virtual memory. So, it seems that the root have unlimited value while
> users have limited value.
>
> Regards,
> Mahmood
>
>
>
>
> On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
> <ole.h.niel...@fysik.dtu.dk> wrote:
>> Hi Mahmood,
>>
>> It seems your compute node is configured with this limit:
>>
>> virtual memory          (kbytes, -v) 72089600
>>
>> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
>> permitted by the system (72089600), this must surely get rejected, as you
>> have discovered!
>>
>> You may want to reconfigure your compute nodes' limits, for example by
>> setting the virtual memory limit to "unlimited" in your configuration. If
>> the nodes has a very small RAM memory + swap space size, you might encounter
>> Out Of Memory errors...
>>
>> /Ole



------------------------------

Message: 4
Date: Sun, 15 Apr 2018 23:11:20 +0430
From: Mahmood Naderan <mahmood...@gmail.com>
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
        <cada2p2xtfsztdtw2_drbtxkkwxz4qdqnlf9p2sbmpu_4c2o...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Excuse me... I think the problem is not pam.d.
How do you interpret the following output?


[hamid@rocks7 case1_source2]$ sbatch slurm_script.sh
Submitted batch job 53
[hamid@rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
max memory size         (kbytes, -m) 65536000
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 72089600
file locks                      (-x) unlimited
^C
[hamid@rocks7 case1_source2]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
                53   CLUSTER hvacStea    hamid  R       0:27      1 compute-0-3
[hamid@rocks7 case1_source2]$ ssh compute-0-3
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
Rocks Compute Node
Rocks 7.0 (Manzanita)
Profile built 19:21 11-Apr-2018

Kickstarted 19:37 11-Apr-2018
[hamid@compute-0-3 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256712
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[hamid@compute-0-3 ~]$



As you can see, the log file where I put  "ulimit -a" before the main
command says limited virtual memory. However, when I login to the
node, it says unlimited!

Regards,
Mahmood




On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bba...@tacc.utexas.edu> wrote:
> Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? 
> That would be enforcing /etc/security/limits.conf for all users which are 
> usually unlimited for root. Root’s almost always allowed to do stuff bad 
> enough to crash the machine or run it out of resources. If the 
> /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the 
> unlimited setting for root is coming from.
>
> Best,
> Bill.



------------------------------

Message: 5
Date: Sun, 15 Apr 2018 19:02:48 +0000
From: Bill Barth <bba...@tacc.utexas.edu>
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <9a10d099-77fd-4892-9288-9708b796f...@tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Mahmood, sorry to presume. I meant to address the root user and your ssh to the 
node in your example.

At our site, we use UsePAM=1 in our slurm.conf, and our /etc/pam.d/slurm and 
slurm.pam files both contain pam_limits.so, so it could be that way for you, 
too. I.e. Slurm could be setting the limits for jobscripts for your users, but 
for root SSHes, where that’s being set by PAM through another config file. 
Also, root’s limits are potentially differently set by PAM (in 
/etc/security/limits.conf) or the kernel at boot time.

Finally, users should be careful using ulimit in their job scripts b/c that can 
only change the limits for that shell script process and not across nodes. That 
jobscript appears to only apply to one node, but if they want different limits 
for jobs that span nodes, they may need to use other features of SLURM to get 
them across all  the nodes their job wants (cgroups, perhaps?).

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445



On 4/15/18, 1:41 PM, "slurm-users on behalf of Mahmood Naderan" 
<slurm-users-boun...@lists.schedmd.com on behalf of mahmood...@gmail.com> wrote:

    Excuse me... I think the problem is not pam.d.
    How do you interpret the following output?


    [hamid@rocks7 case1_source2]$ sbatch slurm_script.sh
    Submitted batch job 53
    [hamid@rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
    max memory size         (kbytes, -m) 65536000
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 4096
    virtual memory          (kbytes, -v) 72089600
    file locks                      (-x) unlimited
    ^C
    [hamid@rocks7 case1_source2]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES
    NODELIST(REASON)
                    53   CLUSTER hvacStea    hamid  R       0:27      1 
compute-0-3
    [hamid@rocks7 case1_source2]$ ssh compute-0-3
    Warning: untrusted X11 forwarding setup failed: xauth key data not generated
    Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
    Rocks Compute Node
    Rocks 7.0 (Manzanita)
    Profile built 19:21 11-Apr-2018

    Kickstarted 19:37 11-Apr-2018
    [hamid@compute-0-3 ~]$ ulimit -a
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 256712
    max locked memory       (kbytes, -l) unlimited
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 4096
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited
    [hamid@compute-0-3 ~]$



    As you can see, the log file where I put  "ulimit -a" before the main
    command says limited virtual memory. However, when I login to the
    node, it says unlimited!

    Regards,
    Mahmood




    On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bba...@tacc.utexas.edu> wrote:
    > Are you using pam_limits.so in any of your /etc/pam.d/ configuration 
files? That would be enforcing /etc/security/limits.conf for all users which 
are usually unlimited for root. Root’s almost always allowed to do stuff bad 
enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd 
file has pam_limits.so in it, that’s probably where the unlimited setting for 
root is coming from.
    >
    > Best,
    > Bill.




End of slurm-users Digest, Vol 6, Issue 21
******************************************

[slurm-users] slurm jobs are pending but resources are available

Reply via email to