I have the same problems, nothing useful in the logs. I'm using slurm 2.4.3 on SL6.
Barbara On 02/01/2013 01:28 PM, Mario Kadastik wrote: > Hi, > > I just had slurmctld disappear out of the blue. A user reported that he > cannot get job info and upon inspection slurmctld wasn't running. I started > it again and looking at logs I see: > > /var/log/messages: > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105649 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105650 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105651 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version 2.4.4 started on > cluster t2estonia > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64 nodes > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9 > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9 > > job-completion log doesn't show anything odd either: > JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 > JobState=NODE_FAIL Partition=main TimeLimit=2880 > StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51 NodeList=wn-v-3944 > NodeCnt=0 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29 > JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 > JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53 NodeList=wn-v-7023 > NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 > JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54 NodeList=wn-v-7001 > NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 > JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54 NodeList=wn-v-7023 > NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > > and as far as I know there's no other log. > > I haven't configured the backup controller yet, but might do now. Any ideas > how to debug further why it died? Or is that something that is a known issue > in 2.4.4? > > Thanks in advance, > > Mario Kadastik, PhD > Researcher > > --- > "Physics is like sex, sure it may have practical reasons, but that's not > why we do it" > -- Richard P. Feynman
