Hi Brian, Sorry for the very late reply, but we had some troubles with our emails for the past few weeks so I was unable to reply sooner. I also hope I'm replying properly, as I'm quite new to mailing lists and replying through digests.
Your suggestion might have been right on the money. I tried getting a reading of the used inodes with 'df -i' but it kept returning an IUse of 0% which was particularly odd. In any case, I ended up zipping some big folders anyway and that seems to have solved the problem. Thank you so much for your help. Best, Pedro Luiz de Castro IT Support & System Administrator Information Systems Faculdade de Medicina, Universidade de Lisboa Avenida Professor Egas Moniz, 1649-028, Lisboa, Portugal iMM Lisboa general contact (+351) 217 999 411 - ext: 47356 imm.medicina.ulisboa.pt -----Original Message----- From: Brian Andrus <[email protected]> To: [email protected] Subject: Re: [slurm-users] Slurm Crashing - File has zero size Message-ID: <[email protected]> Content-Type: text/plain; charset="utf-8"; Format="flowed" You may have space, but do you have enough inodes? Two different things to look at when trying to see why you cannot write to a disk. Also verify that it is writeable by SlurmUser. If something happened and it automatically remounted itself as read-only, that can do it too. Brian Andrus On 10/28/2021 11:57 AM, Pedro Luiz de Castro wrote: > > Hello all > > Since yesterday we?ve been having some trouble with slurm where it > crashes and isn?t able to recover. > I?ve managed to track the fault to a zero sized file, launching > slurmctld -Dvvvv > > slurmctld: File > /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero > size > > That?s the StateSaveLocation, so the environment file for this > particular job is not getting correctly created. > I don?t believe it?s a space issue as there?s about 2TB of free space > on this mountpoint. > > Shouldn?t be permissions either, as other jobs run fine and get completed. > > For now I?ve been launching slurmctld -i to work around this issue, > killing the job in question. > > This way slurm can still be running for our users. > > Any ideas where I should look next to try and troubleshoot this issue? > > Thanks for all the help in advance. > > Best regards, > > *Pedro Luiz de Castro* > > IT Support & System Administrator > Information Systems > > iMM_JLA_horizontal_RGB_cor_positivo > > Faculdade de Medicina, Universidade de Lisboa Avenida Professor Egas > Moniz, 1649?-?028, Lisboa, Portugal iMM Lisboa general contact (+?351) > ?217 ?999 ?411 - ext: 47356 >
