Hello, I have a problem with 1 node in our cluster. It is exactly as all the other nodes (200 GB of temporary storage)
Here is what I have in slurm.conf: # COMPUTES TmpFS=/local/scratch # NODES GresTypes=disk,gpu ReturnToService=2 NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000 NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20 The node that has the trouble is tars-XXX. Here is what I have in gres.conf: # Local disk space in MB (/local/scratch) NodeName=tars-[ZZZ-UUU] Name=disk Count=204000 XXX is in range: [ZZZ,UUU]. If I ssh to tars-XXX, here is what I get: -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.6G 1.9G 47% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 84K 500M 1% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch /local/scratch is the directory for temporary storage. The problem is when I do scontrol show node tars-XXX, I get: NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin Gres=disk:204000,gpu:0 NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05 OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A MCS_label=N/A BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low TmpDisk [slurm@2017-10-10T11:25:04] And in the slurmctld logs, I get the error message: 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX has low tmp_disk size (129186 < 204000) 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: _slurm_rpc_node_registration node=tars-XXX: Invalid argument I tried to reboot tars-XXX yesterday but the problem is still here. I also tried: scontrol update NodeName=ClusterNode0 State=Resume but state went back to DRAIN after a while… Does anyone have an idea of what could cause the problem? My configuration files seem correct and there really are 200G free in /local/scratch on tars-XXX… I thank you in advance for any help. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03