[slurm-dev] Node always going to DRAIN state with reason=Low TmpDisk

Véronique LEGRAND Tue, 10 Oct 2017 03:01:37 -0700

Hello,

I have a problem with 1 node in our cluster. It is exactly as all the other 
nodes (200 GB of temporary storage)


Here is what I have in slurm.conf:

# COMPUTES
TmpFS=/local/scratch

# NODES
GresTypes=disk,gpu
ReturnToService=2
NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000
NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 
Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20

The node that has the trouble is tars-XXX.

Here is what I have in gres.conf:

# Local disk space in MB (/local/scratch)
NodeName=tars-[ZZZ-UUU] Name=disk Count=204000

XXX is in range: [ZZZ,UUU].

If I ssh to tars-XXX, here is what I get:

-sh-4.1$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
slash_root      3.5G  1.6G  1.9G  47% /
tmpfs           127G     0  127G   0% /dev/shm
tmpfs           500M   84K  500M   1% /tmp
/dev/sda1       200G   33M  200G   1% /local/scratch

/local/scratch is the directory for temporary storage.

The problem is  when I do
scontrol show node tars-XXX,

I get:

NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00
   AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin
   ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin
   Gres=disk:204000,gpu:0
   NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05
   OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A 
MCS_label=N/A
   BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low TmpDisk [slurm@2017-10-10T11:25:04]


And in the slurmctld logs, I get the error message:
2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX 
has low tmp_disk size (129186 < 204000)
2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: 
_slurm_rpc_node_registration node=tars-XXX: Invalid argument

I tried to reboot tars-XXX yesterday but the problem is still here.
I also tried:
scontrol update  NodeName=ClusterNode0 State=Resume
but state went back to DRAIN after a while…

Does anyone have an idea of what could cause the problem? My configuration 
files seem correct and there really are 200G free in /local/scratch on tars-XXX…

I thank you in advance for any help.

Regards,


Véronique







--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03

[slurm-dev] Node always going to DRAIN state with reason=Low TmpDisk

Reply via email to