On 2/1/13 5:57 PM, Marcin Stolarek wrote:
Re: [slurm-dev] Re: slurmctld crash, no reason

2013/2/1 Barbara Krasovec <[email protected] <mailto:[email protected]>>


    I have the same problems, nothing useful in the logs.
    I'm using slurm 2.4.3 on SL6.

    Barbara

    On 02/01/2013 01:28 PM, Mario Kadastik wrote:
    > Hi,
    >
    > I just had slurmctld disappear out of the blue. A user reported
    that he cannot get job info and upon inspection slurmctld wasn't
    running. I started it again and looking at logs I see:
    >
    > /var/log/messages:
    > Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
    JobId=105649 NodeList=wn-v-3944 #CPUs=1
    > Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
    JobId=105650 NodeList=wn-v-3944 #CPUs=1
    > Feb  1 12:32:55 slurm-1 slurmctld[21150]: sched: Allocate
    JobId=105651 NodeList=wn-v-3944 #CPUs=1
    > Feb  1 14:10:49 slurm-1 slurmctld[29606]: slurmctld version
    2.4.4 started on cluster t2estonia
    > Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered state of 64
    nodes
    > Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98440 9
    > Feb  1 14:10:49 slurm-1 slurmctld[29606]: Recovered job 98555 9
    >
    > job-completion log doesn't show anything odd either:
    > JobId=105651 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
    Name=CMS_CRAB2 JobState=NODE_FAIL Partition=main TimeLimit=2880
    StartTime=2013-02-01T11:51:59 EndTime=2013-02-01T12:28:51
    NodeList=wn-v-3944 NodeCnt=0 ProcCnt=1
    WorkDir=/home/joosep/singletop/stpol/crabs/step1_MC_Jan29
    > JobId=100499 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
    Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
    StartTime=2013-02-01T09:19:52 EndTime=2013-02-01T12:28:53
    NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
    WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
    > JobId=104537 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
    Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
    StartTime=2013-02-01T11:42:03 EndTime=2013-02-01T14:10:54
    NodeList=wn-v-7001 NodeCnt=1 ProcCnt=1
    WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
    > JobId=100505 UserId=joosep(1000006) GroupId=HEPUsers(1000002)
    Name=CMS_CRAB2 JobState=COMPLETED Partition=main TimeLimit=2880
    StartTime=2013-02-01T09:19:53 EndTime=2013-02-01T14:10:54
    NodeList=wn-v-7023 NodeCnt=1 ProcCnt=1
    WorkDir=/home/joosep/singletop/stpol/crabs/step1_Data_all
    >
    > and as far as I know there's no other log.
    >
    > I haven't configured the backup controller yet, but might do
    now. Any ideas how to debug further why it died? Or is that
    something that is a known issue in 2.4.4?
    >
    > Thanks in advance,
    >
    > Mario Kadastik, PhD
    > Researcher
    >
    > ---
    >    "Physics is like sex, sure it may have practical reasons, but
    that's not why we do it"
    >       -- Richard P. Feynman


Are you monitoring memory usage of your hosts? For me such a situation happend when we used 100% of swap.

Maybe you should consider increasing log level?

cheers,
marcin



I actually see some errors for the clocks being out of sync in the slurmctld.log for the last time slurmctld crashed. It also crashed twice before, but I don't know the date and as much as I remember I couldn't find any error messages in the logs at the time. I will send the logs if it happens again.

I thought that slurmctld puts the node in drain state if something is wrong (machine swapping)..
Cheers,
Barbara



Reply via email to