Hi, upon startup slurmctld changes its working directory to where the log file is. If the log file is: SlurmctldLogFile=/var/tmp/slurm/slurmctld.log the working directory is /var/tmp/slurm. Assuming your slurmctld core dump for whatever reason the core file should be there. The directory should be writable by the SlurmUser since slurmctld changes it real user id to that user.
/David On Fri, Feb 1, 2013 at 1:32 PM, Mario Kadastik <[email protected]>wrote: > > Hi, > > I just had slurmctld disappear out of the blue. A user reported that he > cannot get job info and upon inspection slurmctld wasn't running. I started > it again and looking at logs I see: > > /var/log/messages: > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105649 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105650 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105651 > NodeList=wn-v-3944 #CPUs=1 > Feb 1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version 2.4.4 started > on cluster t2estonia > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64 nodes > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9 > Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9 > > job-completion log doesn't show anything odd either: > JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002) > Name=CMS_CRAB2 JobState=NODE_FAIL Partition=main TimeLimit=2880 > StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51 > NodeList=wn-v-3944 NodeCnt=0 ProcCnt=1 > WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29 > JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002) > Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53 > NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1 > WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002) > Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54 > NodeList=wn-v-7001 NodeCnt=1 ProcCnt=1 > WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002) > Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880 > StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54 > NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1 > WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all > > and as far as I know there's no other log. > > I haven't configured the backup controller yet, but might do now. Any > ideas how to debug further why it died? Or is that something that is a > known issue in 2.4.4? > > Thanks in advance, > > Mario Kadastik, PhD > Researcher > > --- > "Physics is like sex, sure it may have practical reasons, but that's not > why we do it" > -- Richard P. Feynman >
