[slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk

Uwe Sauter Tue, 10 Oct 2017 05:21:08 -0700

Hi,

see the man page for slurm.conf:


TmpFS
    Fully qualified pathname of the file system available to user jobs for 
temporary storage. This parameter is used in
establishing a node's TmpDisk space. The default value is "/tmp".


So it is using /tmp. You need to change that parameter to /local/scratch and 
then it should work.

Regards,

        Uwe

Am 10.10.2017 um 14:09 schrieb Véronique  LEGRAND:
> Hello Pierre-Marie,
> 
>  
> 
> First, thank you for your hint.
> 
> I just tried.
> 
>  
> 
>>slurmd -C
> 
> NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 
> ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
> 
> UpTime=0-20:50:54
> 
>  
> 
> The value for TmpDisk is erroneous. I do not know what can be the cause of 
> this since the operating system df command gives the
> right values.
> 
>  
> 
> -sh-4.1$ df -hl
> 
> Filesystem      Size  Used Avail Use% Mounted on
> 
> slash_root      3.5G  1.6G  1.9G  47% /
> 
> tmpfs           127G     0  127G   0% /dev/shm
> 
> tmpfs           500M   84K  500M   1% /tmp
> 
> /dev/sda1       200G   33M  200G   1% /local/scratch
> 
>  
> 
>  
> 
> Could slurmd be messing up tmpfs with /local/scratch?
> 
>  
> 
> I tried the same thing on another similar node (tars-XXX-1)
> 
>  
> 
> I got:
> 
>  
> 
> -sh-4.1$ df -hl
> 
> Filesystem      Size  Used Avail Use% Mounted on
> 
> slash_root      3.5G  1.7G  1.8G  49% /
> 
> tmpfs           127G     0  127G   0% /dev/shm
> 
> tmpfs           500M  5.7M  495M   2% /tmp
> 
> /dev/sda1       200G   33M  200G   1% /local/scratch
> 
>  
> 
> and
> 
>  
> 
> slurmd -C
> 
> NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 
> ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
> 
> UpTime=101-21:34:14
> 
>  
> 
>  
> 
> So, slurmd –C gives exactly the same answer but this node doesn’t go into 
> DRAIN state; it works perfectly.
> 
>  
> 
> Thank you again for your help.
> 
>  
> 
> Regards,
> 
>  
> 
> Véronique
> 
>  
> 
>  
> 
>  
> 
> --
> 
> Véronique Legrand
> 
> IT engineer – scientific calculation & software development
> 
> https://research.pasteur.fr/en/member/veronique-legrand/
> 
> Cluster and computing group
> 
> IT department
> 
> Institut Pasteur Paris
> 
> Tel : 95 03
> 
>  
> 
>  
> 
> *From: *"Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com>
> *Reply-To: *slurm-dev <slurm-dev@schedmd.com>
> *Date: *Tuesday, 10 October 2017 at 13:53
> *To: *slurm-dev <slurm-dev@schedmd.com>
> *Subject: *[slurm-dev] RE: Node always going to DRAIN state with reason=Low 
> TmpDisk
> 
>  
> 
> Hi Véronique,
> 
>  
> 
> Did you check the result of slurmd -C on tars-XXX ?
> 
>  
> 
> Regards,
> 
> Pierre-Marie Le Biot
> 
>  
> 
> *From:*Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr]
> *Sent:* Tuesday, October 10, 2017 12:02 PM
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Subject:* [slurm-dev] Node always going to DRAIN state with reason=Low 
> TmpDisk
> 
>  
> 
> Hello,
> 
>  
> 
> I have a problem with 1 node in our cluster. It is exactly as all the other 
> nodes (200 GB of temporary storage)
> 
>  
> 
> Here is what I have in slurm.conf:
> 
>  
> 
> # COMPUTES
> 
> TmpFS=/local/scratch
> 
>  
> 
> # NODES
> 
> GresTypes=disk,gpu
> 
> ReturnToService=2
> 
> NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000
> 
> NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 
> Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20
> 
>  
> 
> The node that has the trouble is tars-XXX.
> 
>  
> 
> Here is what I have in gres.conf:
> 
>  
> 
> # Local disk space in MB (/local/scratch)
> 
> NodeName=tars-[ZZZ-UUU] Name=disk Count=204000
> 
>  
> 
> XXX is in range: [ZZZ,UUU].
> 
>  
> 
> If I ssh to tars-XXX, here is what I get:
> 
>  
> 
> -sh-4.1$ df -hl
> 
> Filesystem      Size  Used Avail Use% Mounted on
> 
> slash_root      3.5G  1.6G  1.9G  47% /
> 
> tmpfs           127G     0  127G   0% /dev/shm
> 
> tmpfs           500M   84K  500M   1% /tmp
> 
> /dev/sda1       200G   33M  200G   1% /local/scratch
> 
>  
> 
> /local/scratch is the directory for temporary storage.
> 
>  
> 
> The problem is  when I do
> 
> scontrol show node tars-XXX,
> 
>  
> 
> I get:
> 
>  
> 
> NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6
> 
>    CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00
> 
>    AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin
> 
>    ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin
> 
>    Gres=disk:204000,gpu:0
> 
>    NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05
> 
>    OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1
> 
>    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A 
> MCS_label=N/A
> 
>    BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57
> 
>    CapWatts=n/a
> 
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> 
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
>    Reason=Low TmpDisk [slurm@2017-10-10T11:25:04]
> 
>  
> 
>  
> 
> And in the slurmctld logs, I get the error message:
> 
> 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX 
> has low tmp_disk size (129186 < 204000)
> 
> 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: 
> _slurm_rpc_node_registration node=tars-XXX: Invalid argument
> 
>  
> 
> I tried to reboot tars-XXX yesterday but the problem is still here.
> 
> I also tried:
> 
> scontrol update  NodeName=ClusterNode0 State=Resume
> 
> but state went back to DRAIN after a while…
> 
>  
> 
> Does anyone have an idea of what could cause the problem? My configuration 
> files seem correct and there really are 200G free in
> /local/scratch on tars-XXX…
> 
>  
> 
> I thank you in advance for any help.
> 
>  
> 
> Regards,
> 
>  
> 
>  
> 
> Véronique
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> --
> 
> Véronique Legrand
> 
> IT engineer – scientific calculation & software development
> 
> https://research.pasteur.fr/en/member/veronique-legrand/
> 
> Cluster and computing group
> 
> IT department
> 
> Institut Pasteur Paris
> 
> Tel : 95 03
> 
>  
>

[slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk

Reply via email to