Véronique, So that’s the culprit : 2017-10-09T17:09:57.957336+02:00 tars-XXX slurmd[18640]: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
For a reason you have to determine, when slurmd starts on tars-XXX it finds that the size of /local/scratch (assuming that this is the value of TmpFS in slurm.conf for this node) is 129186MB and sends this value to slurmctld which compares it with the value recorded in slurm.conf, that is 204000 for that node. By the way, 129186MB is very close to the size of /dev/shm… About the value returned by slurmd -C (500), it could be that /tmp is harcoded somewhere. Regards, Pierre-Marie Le Biot From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] Sent: Tuesday, October 10, 2017 4:33 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Pierre-Marie, Here is what I have in slurmd.log on tars-XXX -sh-4.1$ sudo cat slurmd.log 2017-10-09T17:09:57.538636+02:00 tars-XXX slurmd[18597]: Message aggregation enabled: WindowMsgs=24, WindowTime=200 2017-10-09T17:09:57.647486+02:00 tars-XXX slurmd[18597]: CPU frequency setting not configured for this node 2017-10-09T17:09:57.647499+02:00 tars-XXX slurmd[18597]: Resource spec: Reserved system memory limit not configured for this node 2017-10-09T17:09:57.808352+02:00 tars-XXX slurmd[18597]: cgroup namespace 'freezer' is now mounted 2017-10-09T17:09:57.844400+02:00 tars-XXX slurmd[18597]: cgroup namespace 'cpuset' is now mounted 2017-10-09T17:09:57.902418+02:00 tars-XXX slurmd[18640]: slurmd version 16.05.9 started 2017-10-09T17:09:57.957030+02:00 tars-XXX slurmd[18640]: slurmd started on Mon, 09 Oct 2017 17:09:57 +0200 2017-10-09T17:09:57.957336+02:00 tars-XXX slurmd[18640]: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 From: "Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com<mailto:pierre-marie.leb...@hpe.com>> Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Date: Tuesday, 10 October 2017 at 15:20 To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Véronique, This not what I expected, I was thinking slurmd -C would return TmpDisk=204000 or more probably 129186 as seen in slurmctld log. I suppose that you already checked slurmd logs on tars-XXX ? Regards, Pierre-Marie Le Biot From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] Sent: Tuesday, October 10, 2017 2:09 PM To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Hello Pierre-Marie, First, thank you for your hint. I just tried. >slurmd -C NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 UpTime=0-20:50:54 The value for TmpDisk is erroneous. I do not know what can be the cause of this since the operating system df command gives the right values. -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.6G 1.9G 47% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 84K 500M 1% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch Could slurmd be messing up tmpfs with /local/scratch? I tried the same thing on another similar node (tars-XXX-1) I got: -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.7G 1.8G 49% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 5.7M 495M 2% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch and slurmd -C NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 UpTime=101-21:34:14 So, slurmd –C gives exactly the same answer but this node doesn’t go into DRAIN state; it works perfectly. Thank you again for your help. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 From: "Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com<mailto:pierre-marie.leb...@hpe.com>> Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Date: Tuesday, 10 October 2017 at 13:53 To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Hi Véronique, Did you check the result of slurmd -C on tars-XXX ? Regards, Pierre-Marie Le Biot From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] Sent: Tuesday, October 10, 2017 12:02 PM To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] Node always going to DRAIN state with reason=Low TmpDisk Hello, I have a problem with 1 node in our cluster. It is exactly as all the other nodes (200 GB of temporary storage) Here is what I have in slurm.conf: # COMPUTES TmpFS=/local/scratch # NODES GresTypes=disk,gpu ReturnToService=2 NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000 NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20 The node that has the trouble is tars-XXX. Here is what I have in gres.conf: # Local disk space in MB (/local/scratch) NodeName=tars-[ZZZ-UUU] Name=disk Count=204000 XXX is in range: [ZZZ,UUU]. If I ssh to tars-XXX, here is what I get: -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.6G 1.9G 47% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 84K 500M 1% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch /local/scratch is the directory for temporary storage. The problem is when I do scontrol show node tars-XXX, I get: NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin Gres=disk:204000,gpu:0 NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05 OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A MCS_label=N/A BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low TmpDisk [slurm@2017-10-10T11:25:04] And in the slurmctld logs, I get the error message: 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX has low tmp_disk size (129186 < 204000) 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: _slurm_rpc_node_registration node=tars-XXX: Invalid argument I tried to reboot tars-XXX yesterday but the problem is still here. I also tried: scontrol update NodeName=ClusterNode0 State=Resume but state went back to DRAIN after a while… Does anyone have an idea of what could cause the problem? My configuration files seem correct and there really are 200G free in /local/scratch on tars-XXX… I thank you in advance for any help. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03