Hi, upon startup slurmctld changes its working directory to where the log
file is.
If the log file is:
SlurmctldLogFile=/var/tmp/slurm/slurmctld.log
the working directory is /var/tmp/slurm. Assuming your slurmctld core dump
for whatever reason the core file should be there.
The directory should be writable by the SlurmUser since slurmctld changes
it real user id to that user.

/David


On Fri, Feb 1, 2013 at 1:32 PM, Mario Kadastik <[email protected]>wrote:

>
> Hi,
>
> I just had slurmctld disappear out of the blue. A user reported that he
> cannot get job info and upon inspection slurmctld wasn't running. I started
> it again and looking at logs I see:
>
> /var/log/messages:
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105649
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105650
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105651
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version 2.4.4 started
> on cluster t2estonia
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64 nodes
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9
>
> job-completion log doesn't show anything odd either:
> JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
> Name=CMS_CRAB2 JobState=NODE_FAIL Partition=main TimeLimit=2880
> StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51
> NodeList=wn-v-3944 NodeCnt=0 ProcCnt=1
> WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29
> JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
> Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
> StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53
> NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
> WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
> Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
> StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54
> NodeList=wn-v-7001 NodeCnt=1 ProcCnt=1
> WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
> Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
> StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54
> NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
> WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
>
> and as far as I know there's no other log.
>
> I haven't configured the backup controller yet, but might do now. Any
> ideas how to debug further why it died? Or is that something that is a
> known issue in 2.4.4?
>
> Thanks in advance,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
>   "Physics is like sex, sure it may have practical reasons, but that's not
> why we do it"
>      -- Richard P. Feynman
>

Reply via email to