I have the same problems, nothing useful in the logs.
I'm using slurm 2.4.3 on SL6.

Barbara

On 02/01/2013 01:28 PM, Mario Kadastik wrote:
> Hi,
>
> I just had slurmctld disappear out of the blue. A user reported that he 
> cannot get job info and upon inspection slurmctld wasn't running. I started 
> it again and looking at logs I see:
>
> /var/log/messages:
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105649 
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105650 
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate JobId=105651 
> NodeList=wn-v-3944 #CPUs=1
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version 2.4.4 started on 
> cluster t2estonia
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64 nodes
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9
> Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9
>
> job-completion log doesn't show anything odd either:
> JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 
> JobState=NODE_FAIL Partition=main TimeLimit=2880 
> StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51 NodeList=wn-v-3944 
> NodeCnt=0 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29
> JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 
> JobState=COMPLETED Partition=main TimeLimit=2880 
> StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53 NodeList=wn-v-7023 
> NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 
> JobState=COMPLETED Partition=main TimeLimit=2880 
> StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54 NodeList=wn-v-7001 
> NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002) Name=CMS_CRAB2 
> JobState=COMPLETED Partition=main TimeLimit=2880 
> StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54 NodeList=wn-v-7023 
> NodeCnt=1 ProcCnt=1 WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
>
> and as far as I know there's no other log.
>
> I haven't configured the backup controller yet, but might do now. Any ideas 
> how to debug further why it died? Or is that something that is a known issue 
> in 2.4.4?
>
> Thanks in advance,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
>    "Physics is like sex, sure it may have practical reasons, but that's not 
> why we do it"
>       -- Richard P. Feynman

Reply via email to