This is the first report of any issue like this. Even if you don't see something useful in the logs they are usually useful to reproduce or diagnose the problem. If you could send your slurmctld.log and your slurm.conf someone might be able to look trough them to see if they could get a reproducer.
If you have an idea of how to reproduce it or anything that was happening on the system at the time of the crash that would be helpful as well. Keep in mind there will not be any future versions of 2.4. Any fix discovered would need to be backported manually. Danny Barbara Krasovec <[email protected]> wrote: > >I have the same problems, nothing useful in the logs. >I'm using slurm 2.4.3 on SL6. > >Barbara > >On 02/01/2013 01:28 PM, Mario Kadastik wrote: >> Hi, >> >> I just had slurmctld disappear out of the blue. A user reported that >he cannot get job info and upon inspection slurmctld wasn't running. I >started it again and looking at logs I see: >> >> /var/log/messages: >> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate >JobId=105649 NodeList=wn-v-3944 #CPUs=1 >> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate >JobId=105650 NodeList=wn-v-3944 #CPUs=1 >> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate >JobId=105651 NodeList=wn-v-3944 #CPUs=1 >> Feb 1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version 2.4.4 >started on cluster t2estonia >> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64 nodes >> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9 >> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9 >> >> job-completion log doesn't show anything odd either: >> JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002) >Name=CMS_CRAB2 JobState=NODE_FAIL Partition=main TimeLimit=2880 >StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51 >NodeList=wn-v-3944 NodeCnt=0 ProcCnt=1 >WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29 >> JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002) >Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 >StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53 >NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1 >WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all >> JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002) >Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 >StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54 >NodeList=wn-v-7001 NodeCnt=1 ProcCnt=1 >WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all >> JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002) >Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 >StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54 >NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1 >WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all >> >> and as far as I know there's no other log. >> >> I haven't configured the backup controller yet, but might do now. Any >ideas how to debug further why it died? Or is that something that is a >known issue in 2.4.4? >> >> Thanks in advance, >> >> Mario Kadastik, PhD >> Researcher >> >> --- >> "Physics is like sex, sure it may have practical reasons, but >that's not why we do it" >> -- Richard P. Feynman
