Hi,

This purpose of this mail is to warn others planning to upgrade an old
slurm installation of a bug we ran into when upgrading one cluster from
2.3 to 2.4. The bug affects installations using
JobAcctGatherType=jobacct_gather/linux if you have jobs running during
the upgrade.

We will not do any work to fix this bug in 2.4. Our remaining systems at
NSC that run Slurm 2.3 use JobAcctGatherType=jobacct_gather/none so I
believe they will not hit this bug once we upgrade them. If anybody like
to fix this for 2.4, they should look at Slurm 2.6.1 commit
e804c9bb8f3bd1f57c49f74d685328937df6654c where the code in question have
already been made a lot more robust.


Several hours after our upgrade to 2.4, nodes that had jobs running
during the upgrade was suddenly set DOWN. The reason was that their
slurmds stopped responding, as they could no longer create any more
threads.

example slurmd.log:
    [2013-10-29T20:48:55] active_threads == MAX_THREADS(130)

Debugging slurmd shows that 129 of the threads were waiting for a lock:

    Thread 9 (Thread 0x2ba77d642940 (LWP 32268)):
    #0  0x0000003e7da0d654 in __lll_lock_wait () from /lib64/libpthread.so.0
    #1  0x0000003e7da08f4a in _L_lock_1034 () from /lib64/libpthread.so.0
    #2  0x0000003e7da08e0c in pthread_mutex_lock () from /lib64/libpthread.so.0
    #3  0x00000000004acccd in jobacct_gather_g_create (jobacct_id=0x0) at 
slurm_jobacct_gather.c:325
    #4  0x00000000004c2194 in stepd_stat_jobacct (fd=253, sent=<value optimized 
out>, resp=0x11755e38)
        at stepd_api.c:989
    #5  0x000000000042a28d in _enforce_job_mem_limit () at req.c:1753
    #6  0x000000000042a721 in _rpc_ping (msg=0x117648f8) at req.c:1847
    #7  0x000000000042bd35 in slurmd_req (msg=0x721b40) at req.c:336
    #8  0x0000000000422250 in _service_connection (arg=<value optimized out>) 
at slurmd.c:525
    #9  0x0000003e7da0683d in start_thread () from /lib64/libpthread.so.0
    #10 0x0000003e7cad4f8d in clone () from /lib64/libc.so.6

The lock g_jobacct_gather_context_lock was held by the second oldest
thread: (oldest one was of course the slurmd "master" thread)

    Thread 131 (Thread 0x2ba775bc8940 (LWP 16744)):
    #0  0x0000003e7da0daab in read () from /lib64/libpthread.so.0
    #1  0x00000000004a9772 in read (jobacct=0x11727368, type=<value optimized 
out>, data=0x2ba775bc7d6c)
        at /usr/include/bits/unistd.h:35
    #2  jobacct_common_getinfo (jobacct=0x11727368, type=<value optimized out>, 
data=0x2ba775bc7d6c)
        at jobacct_common.c:216
    #3  0x00000000004aca70 in jobacct_gather_g_getinfo (jobacct=0x11727368, 
type=JOBACCT_DATA_PIPE, 
        data=0x2ba775bc7d6c) at slurm_jobacct_gather.c:373
    #4  0x00000000004c22e7 in stepd_stat_jobacct (fd=9, sent=<value optimized 
out>, resp=0x11734da8)
        at stepd_api.c:996
    #5  0x000000000042a28d in _enforce_job_mem_limit () at req.c:1753
    #6  0x000000000042a721 in _rpc_ping (msg=0x11725d38) at req.c:1847
    #7  0x000000000042bd35 in slurmd_req (msg=0x9) at req.c:336
    #8  0x0000000000422250 in _service_connection (arg=<value optimized out>) 
at slurmd.c:525
    #9  0x0000003e7da0683d in start_thread () from /lib64/libpthread.so.0
    #10 0x0000003e7cad4f8d in clone () from /lib64/libc.so.6

So everything is waiting for a read(). Here is the code in question from
jobacct_common.c:jobacct_common_getinfo():

215:    case JOBACCT_DATA_PIPE:
216:            safe_read(*fd, jobacct, sizeof(struct jobacctinfo));
217:            break;

The problem is that the jobacctinfo struct have changed (commit
468326c4d0afdc331320d69c02b612016ab4123e) and is bigger in 2.4 than it
was in 2.3. So when reading from a 2.3 slurmstepd the read will not
finish.

Regards,
Pär Lindfors, NSC

Reply via email to