Thank you very much to you and Uwe, We will do the tests very soon (luckily this happens on only one machine).
As for slurmd –C, this explains why it says “500” for tmpFS’s size. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 On 12/10/2017, 11:21, "Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com> wrote: Hello Véronique, slurmd uses statvfs or statfs (choice made at build time) to get TmpFS size (with a default of /tmp if TmpFS is null). I think Uwe is right, it could be a filesystem mount timing problem. You can test it easily : - starting from a correct state, stop slurmd - unmount /local/scratch - start slurmd - check slurmd.log Regarding slurmd -C , I checked the source (17.2.6) the size of /tmp is returned (hardcoded), not sure if it should be considered as a bug. Regards, Pierre-Marie Le Biot -----Original Message----- From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] Sent: Wednesday, October 11, 2017 4:18 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk What distribution are you using? If it is using systemd then it is possible that slurmd gets started before /local/scratch is mounted. You'd need to add a dependency to the slurmd service so it waits till /local/scratch is mounted before the service is started. Am 11.10.2017 um 15:38 schrieb Véronique LEGRAND: > Hello Pierre-Marie, > > > > I stopped the slurmd daemon on tars-XXX then restarted it in the foreground with: > > > > sudo /my/path/to /slurmd -vvvvvvvv -D -d /opt/slurm/sbin/slurmstepd > > > > and got: > > slurmd: Gres Name=disk Type=(null) Count=204000 > > and also: > > slurmd: debug3: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 > Memory=258373 TmpDisk=204699 Uptime=162294 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > > in the output. > > So, the value this time was correct. > > > > In slurmctld.log, I have: > > 2017-10-11T12:24:01+02:00 tars-master slurmctld[120352]: Node > *tars-113* now responding > > 2017-10-11T12:24:01+02:00 tars-master slurmctld[120352]: node > *tars-113* returned to service > > > > I waited 2 hours and didn’t get any : error: Node tars-XXX has low > tmp_disk size (129186 < 204000) > > So, I stopped it and started it again in the usual way: > > > > sudo /etc/init.d/slurm start at 2:38 pm > > > > I got no error message in slurmctld.log and no erroneous value in slurmd.log. > > > > A 2:49 pm, I reboot the machine and here is what I got in slurmd.log: > > > > -sh-4.1$ sudo cat slurmd.log > > 2017-10-11T14:50:30.742049+02:00 tars-113 slurmd[18621]: Message > aggregation enabled: WindowMsgs=24, WindowTime=200 > > 2017-10-11T14:50:30.797696+02:00 tars-113 slurmd[18621]: CPU frequency > setting not configured for this node > > 2017-10-11T14:50:30.797706+02:00 tars-113 slurmd[18621]: Resource > spec: Reserved system memory limit not configured for this node > > 2017-10-11T14:50:30.986903+02:00 tars-113 slurmd[18621]: cgroup > namespace 'freezer' is now mounted > > 2017-10-11T14:50:31.023900+02:00 tars-113 slurmd[18621]: cgroup > namespace 'cpuset' is now mounted > > 2017-10-11T14:50:31.066430+02:00 tars-113 slurmd[18633]: slurmd > version 16.05.9 started > > 2017-10-11T14:50:31.123213+02:00 tars-113 slurmd[18633]: slurmd > started on Wed, 11 Oct 2017 14:50:31 +0200 > > 2017-10-11T14:50:31.123493+02:00 tars-113 slurmd[18633]: CPUs=12 > Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 > Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) > > > > Erroneous value was back again! > > > > So, I did again : > > sudo /etc/init.d/slurm stop > > sudo /etc/init.d/slurm start > > > > and the following lines were added to the log: > > 2017-10-11T14:51:29.707556+02:00 tars-113 slurmd[18633]: Slurmd > shutdown completing > > 2017-10-11T14:51:51.496552+02:00 tars-113 slurmd[19047]: Message > aggregation enabled: WindowMsgs=24, WindowTime=200 > > 2017-10-11T14:51:51.555792+02:00 tars-113 slurmd[19047]: CPU frequency > setting not configured for this node > > 2017-10-11T14:51:51.555803+02:00 tars-113 slurmd[19047]: Resource > spec: Reserved system memory limit not configured for this node > > 2017-10-11T14:51:51.567003+02:00 tars-113 slurmd[19049]: slurmd > version 16.05.9 started > > 2017-10-11T14:51:51.569174+02:00 tars-113 slurmd[19049]: slurmd > started on Wed, 11 Oct 2017 14:51:51 +0200 > > 2017-10-11T14:51:51.569533+02:00 tars-113 slurmd[19049]: CPUs=12 > Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=204699 > Uptime=155 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) > > > > The value for TmpDisk was correct again. > > > > So, my question is, where does slurmd read the value for the size of > /local/scratch. Does it use “df” or another command? It seems that on > startup, slurmd reads a value that is not yet set correctly… > > > > Thank you in advance for any help. > > > > Regards, > > > > Véronique > > > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > > > > > *From: *"Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com> > *Reply-To: *slurm-dev <slurm-dev@schedmd.com> > *Date: *Tuesday, 10 October 2017 at 17:00 > *To: *slurm-dev <slurm-dev@schedmd.com> > *Subject: *[slurm-dev] RE: Node always going to DRAIN state with > reason=Low TmpDisk > > > > Véronique, > > > > So that’s the culprit : > > 2017-10-09T17:09:57.957336+02:00 tars-XXX slurmd[18640]: CPUs=12 > Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 > Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) > > > > For a reason you have to determine, when slurmd starts on tars-XXX it > finds that the size of /local/scratch (assuming that this is the value > of TmpFS in slurm.conf for this node) is 129186MB and sends this value to slurmctld which compares it with the value recorded in slurm.conf, that is 204000 for that node. > > By the way, 129186MB is very close to the size of /dev/shm… > > > > About the value returned by slurmd -C (500), it could be that /tmp is harcoded somewhere. > > > > Regards, > > Pierre-Marie Le Biot > > > > *From:*Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] > *Sent:* Tuesday, October 10, 2017 4:33 PM > *To:* slurm-dev <slurm-dev@schedmd.com> > *Subject:* [slurm-dev] RE: Node always going to DRAIN state with > reason=Low TmpDisk > > > > Pierre-Marie, > > > > Here is what I have in slurmd.log on tars-XXX > > > > -sh-4.1$ sudo cat slurmd.log > > 2017-10-09T17:09:57.538636+02:00 tars-XXX slurmd[18597]: Message > aggregation enabled: WindowMsgs=24, WindowTime=200 > > 2017-10-09T17:09:57.647486+02:00 tars-XXX slurmd[18597]: CPU frequency > setting not configured for this node > > 2017-10-09T17:09:57.647499+02:00 tars-XXX slurmd[18597]: Resource > spec: Reserved system memory limit not configured for this node > > 2017-10-09T17:09:57.808352+02:00 tars-XXX slurmd[18597]: cgroup > namespace 'freezer' is now mounted > > 2017-10-09T17:09:57.844400+02:00 tars-XXX slurmd[18597]: cgroup > namespace 'cpuset' is now mounted > > 2017-10-09T17:09:57.902418+02:00 tars-XXX slurmd[18640]: slurmd > version 16.05.9 started > > 2017-10-09T17:09:57.957030+02:00 tars-XXX slurmd[18640]: slurmd > started on Mon, 09 Oct 2017 17:09:57 +0200 > > 2017-10-09T17:09:57.957336+02:00 tars-XXX slurmd[18640]: CPUs=12 > Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 > Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > > > > > *From: *"Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com > <mailto:pierre-marie.leb...@hpe.com>> > *Reply-To: *slurm-dev <slurm-dev@schedmd.com > <mailto:slurm-dev@schedmd.com>> > *Date: *Tuesday, 10 October 2017 at 15:20 > *To: *slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> > *Subject: *[slurm-dev] RE: Node always going to DRAIN state with > reason=Low TmpDisk > > > > Véronique, > > > > This not what I expected, I was thinking slurmd -C would return TmpDisk=204000 or more probably 129186 as seen in slurmctld log. > > > > I suppose that you already checked slurmd logs on tars-XXX ? > > > > Regards, > > Pierre-Marie Le Biot > > > > *From:*Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] > *Sent:* Tuesday, October 10, 2017 2:09 PM > *To:* slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> > *Subject:* [slurm-dev] RE: Node always going to DRAIN state with > reason=Low TmpDisk > > > > Hello Pierre-Marie, > > > > First, thank you for your hint. > > I just tried. > > > >>slurmd -C > > NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 > ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 > > UpTime=0-20:50:54 > > > > The value for TmpDisk is erroneous. I do not know what can be the > cause of this since the operating system df command gives the right values. > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.6G 1.9G 47% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 84K 500M 1% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > > > Could slurmd be messing up tmpfs with /local/scratch? > > > > I tried the same thing on another similar node (tars-XXX-1) > > > > I got: > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.7G 1.8G 49% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 5.7M 495M 2% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > and > > > > slurmd -C > > NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 > CoresPerSocket=6 ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 > > UpTime=101-21:34:14 > > > > > > So, slurmd –C gives exactly the same answer but this node doesn’t go into DRAIN state; it works perfectly. > > > > Thank you again for your help. > > > > Regards, > > > > Véronique > > > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > > > > > *From: *"Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com > <mailto:pierre-marie.leb...@hpe.com>> > *Reply-To: *slurm-dev <slurm-dev@schedmd.com > <mailto:slurm-dev@schedmd.com>> > *Date: *Tuesday, 10 October 2017 at 13:53 > *To: *slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> > *Subject: *[slurm-dev] RE: Node always going to DRAIN state with > reason=Low TmpDisk > > > > Hi Véronique, > > > > Did you check the result of slurmd -C on tars-XXX ? > > > > Regards, > > Pierre-Marie Le Biot > > > > *From:*Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] > *Sent:* Tuesday, October 10, 2017 12:02 PM > *To:* slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> > *Subject:* [slurm-dev] Node always going to DRAIN state with > reason=Low TmpDisk > > > > Hello, > > > > I have a problem with 1 node in our cluster. It is exactly as all the > other nodes (200 GB of temporary storage) > > > > Here is what I have in slurm.conf: > > > > # COMPUTES > > TmpFS=/local/scratch > > > > # NODES > > GresTypes=disk,gpu > > ReturnToService=2 > > NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000 > > NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 > Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20 > > > > The node that has the trouble is tars-XXX. > > > > Here is what I have in gres.conf: > > > > # Local disk space in MB (/local/scratch) > > NodeName=tars-[ZZZ-UUU] Name=disk Count=204000 > > > > XXX is in range: [ZZZ,UUU]. > > > > If I ssh to tars-XXX, here is what I get: > > > > -sh-4.1$ df -hl > > Filesystem Size Used Avail Use% Mounted on > > slash_root 3.5G 1.6G 1.9G 47% / > > tmpfs 127G 0 127G 0% /dev/shm > > tmpfs 500M 84K 500M 1% /tmp > > /dev/sda1 200G 33M 200G 1% /local/scratch > > > > /local/scratch is the directory for temporary storage. > > > > The problem is when I do > > scontrol show node tars-XXX, > > > > I get: > > > > NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6 > > CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 > > AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin > > ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin > > Gres=disk:204000,gpu:0 > > NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05 > > OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 > Boards=1 > > State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 > Owner=N/A MCS_label=N/A > > BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57 > > CapWatts=n/a > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > Reason=Low TmpDisk [slurm@2017-10-10T11:25:04] > > > > > > And in the slurmctld logs, I get the error message: > > 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node > tars-XXX has low tmp_disk size (129186 < 204000) > > 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: > _slurm_rpc_node_registration node=tars-XXX: Invalid argument > > > > I tried to reboot tars-XXX yesterday but the problem is still here. > > I also tried: > > scontrol update NodeName=ClusterNode0 State=Resume > > but state went back to DRAIN after a while… > > > > Does anyone have an idea of what could cause the problem? My > configuration files seem correct and there really are 200G free in > /local/scratch on tars-XXX… > > > > I thank you in advance for any help. > > > > Regards, > > > > > > Véronique > > > > > > > > > > > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > >