On 2/1/13 5:57 PM, Marcin Stolarek wrote:
Re: [slurm-dev] Re: slurmctld crash, no reason
2013/2/1 Barbara Krasovec <[email protected] <mailto:[email protected]>>
I have the same problems, nothing useful in the logs.
I'm using slurm 2.4.3 on SL6.
Barbara
On 02/01/2013 01:28 PM, Mario Kadastik wrote:
> Hi,
>
> I just had slurmctld disappear out of the blue. A user reported
that he cannot get job info and upon inspection slurmctld wasn't
running. I started it again and looking at logs I see:
>
> /var/log/messages:
> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
JobId=105649 NodeList=wn-v-3944 #CPUs=1
> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
JobId=105650 NodeList=wn-v-3944 #CPUs=1
> Feb 1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
JobId=105651 NodeList=wn-v-3944 #CPUs=1
> Feb 1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version
2.4.4 started on cluster t2estonia
> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64
nodes
> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9
> Feb 1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9
>
> job-completion log doesn't show anything odd either:
> JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
Name=CMS_CRAB2 JobState=NODE_FAIL Partition=main TimeLimit=2880
StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51
NodeList=wn-v-3944 NodeCnt=0 ProcCnt=1
WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29
> JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53
NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54
NodeList=wn-v-7001 NodeCnt=1 ProcCnt=1
WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
> JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54
NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
>
> and as far as I know there's no other log.
>
> I haven't configured the backup controller yet, but might do
now. Any ideas how to debug further why it died? Or is that
something that is a known issue in 2.4.4?
>
> Thanks in advance,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
> "Physics is like sex, sure it may have practical reasons, but
that's not why we do it"
> -- Richard P. Feynman
Are you monitoring memory usage of your hosts? For me such a situation
happend when we used 100% of swap.
Maybe you should consider increasing log level?
cheers,
marcin
I actually see some errors for the clocks being out of sync in the
slurmctld.log for the last time slurmctld crashed. It also crashed twice
before, but I don't know the date and as much as I remember I couldn't
find any error messages in the logs at the time. I will send the logs if
it happens again.
I thought that slurmctld puts the node in drain state if something is
wrong (machine swapping)..
Cheers,
Barbara