Hi, see the man page for slurm.conf:
TmpFS Fully qualified pathname of the file system available to user jobs for temporary storage. This parameter is used in establishing a node's TmpDisk space. The default value is "/tmp". So it is using /tmp. You need to change that parameter to /local/scratch and then it should work. Regards, Uwe Am 10.10.2017 um 14:09 schrieb Véronique LEGRAND: > Hello Pierre-Marie, > > > > First, thank you for your hint. > > I just tried. > > > >>slurmd -C > > NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 > ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 > > UpTime=0-20:50:54 > > > > The value for TmpDisk is erroneous. I do not know what can be the cause of > this since the operating system df command gives the > right values. > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.6G 1.9G 47% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 84K 500M 1% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > > > Could slurmd be messing up tmpfs with /local/scratch? > > > > I tried the same thing on another similar node (tars-XXX-1) > > > > I got: > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.7G 1.8G 49% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 5.7M 495M 2% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > and > > > > slurmd -C > > NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 > ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 > > UpTime=101-21:34:14 > > > > > > So, slurmd –C gives exactly the same answer but this node doesn’t go into > DRAIN state; it works perfectly. > > > > Thank you again for your help. > > > > Regards, > > > > Véronique > > > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > > > > > *From: *"Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com> > *Reply-To: *slurm-dev <slurm-dev@schedmd.com> > *Date: *Tuesday, 10 October 2017 at 13:53 > *To: *slurm-dev <slurm-dev@schedmd.com> > *Subject: *[slurm-dev] RE: Node always going to DRAIN state with reason=Low > TmpDisk > > > > Hi Véronique, > > > > Did you check the result of slurmd -C on tars-XXX ? > > > > Regards, > > Pierre-Marie Le Biot > > > > *From:*Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] > *Sent:* Tuesday, October 10, 2017 12:02 PM > *To:* slurm-dev <slurm-dev@schedmd.com> > *Subject:* [slurm-dev] Node always going to DRAIN state with reason=Low > TmpDisk > > > > Hello, > > > > I have a problem with 1 node in our cluster. It is exactly as all the other > nodes (200 GB of temporary storage) > > > > Here is what I have in slurm.conf: > > > > # COMPUTES > > TmpFS=/local/scratch > > > > # NODES > > GresTypes=disk,gpu > > ReturnToService=2 > > NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000 > > NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 > Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20 > > > > The node that has the trouble is tars-XXX. > > > > Here is what I have in gres.conf: > > > > # Local disk space in MB (/local/scratch) > > NodeName=tars-[ZZZ-UUU] Name=disk Count=204000 > > > > XXX is in range: [ZZZ,UUU]. > > > > If I ssh to tars-XXX, here is what I get: > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.6G 1.9G 47% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 84K 500M 1% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > /local/scratch is the directory for temporary storage. > > > > The problem is when I do > > scontrol show node tars-XXX, > > > > I get: > > > > NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6 > > CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 > > AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin > > ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin > > Gres=disk:204000,gpu:0 > > NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05 > > OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1 > > State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A > MCS_label=N/A > > BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57 > > CapWatts=n/a > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > Reason=Low TmpDisk [slurm@2017-10-10T11:25:04] > > > > > > And in the slurmctld logs, I get the error message: > > 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX > has low tmp_disk size (129186 < 204000) > > 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: > _slurm_rpc_node_registration node=tars-XXX: Invalid argument > > > > I tried to reboot tars-XXX yesterday but the problem is still here. > > I also tried: > > scontrol update NodeName=ClusterNode0 State=Resume > > but state went back to DRAIN after a while… > > > > Does anyone have an idea of what could cause the problem? My configuration > files seem correct and there really are 200G free in > /local/scratch on tars-XXX… > > > > I thank you in advance for any help. > > > > Regards, > > > > > > Véronique > > > > > > > > > > > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > >