On 01/06/2015 05:01 PM, Andy Riebs wrote:
We are seeing occasional SEGV's from squeue in 14.11.2 that we
hadn't seen previously. As near as we can tell, it might happen
when the reason for not scheduling jobs is longer than 32
characters, due to the particular request, such as
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:noden[0692,0777,1788,1836])
Does this ring a bell?
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
Hi,
Today I try to upgrade from version 14.03.7 to version 14.11.2 and seem
to get the same problem. A simple "squeue" command without parameters gives:
# squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
*** buffer overflow detected ***: squeue terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x2b5d8eb1e697]
/lib64/libc.so.6(+0x100580)[0x2b5d8eb1c580]
/lib64/libc.so.6(+0xffc7b)[0x2b5d8eb1bc7b]
/lib64/libc.so.6(__snprintf_chk+0x7a)[0x2b5d8eb1bb4a]
squeue(_print_job_reason_list+0x9d)[0x428c3d]
squeue[0x427735]
squeue(print_job_from_format+0x128)[0x428558]
squeue(slurm_list_for_each+0x4e)[0x4436fe]
squeue(print_jobs_array+0x566)[0x429856]
squeue[0x425142]
squeue(main+0x1f8)[0x4256a8]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b5d8ea3ad5d]
squeue[0x424f79]
======= Memory map: ========
00400000-00549000 r-xp 00000000 fd:00 1321798
/usr/bin/squeue
00748000-0074f000 rw-p 00148000 fd:00 1321798
/usr/bin/squeue
0074f000-00753000 rw-p 00000000 00:00 0
017f2000-019de000 rw-p 00000000 00:00 0 [heap]
2b5d8e3d8000-2b5d8e3f8000 r-xp 00000000 fd:00 1179685
/lib64/ld-2.12.so
2b5d8e3f8000-2b5d8e3f9000 rw-p 00000000 00:00 0
2b5d8e5f7000-2b5d8e5f8000 r--p 0001f000 fd:00 1179685
/lib64/ld-2.12.so
2b5d8e5f8000-2b5d8e5f9000 rw-p 00020000 fd:00 1179685
/lib64/ld-2.12.so
2b5d8e5f9000-2b5d8e5fa000 rw-p 00000000 00:00 0
2b5d8e5fa000-2b5d8e5fc000 r-xp 00000000 fd:00 1179731
/lib64/libdl-2.12.so
2b5d8e5fc000-2b5d8e7fc000 ---p 00002000 fd:00 1179731
/lib64/libdl-2.12.so
2b5d8e7fc000-2b5d8e7fd000 r--p 00002000 fd:00 1179731
/lib64/libdl-2.12.so
2b5d8e7fd000-2b5d8e7fe000 rw-p 00003000 fd:00 1179731
/lib64/libdl-2.12.so
2b5d8e7fe000-2b5d8e815000 r-xp 00000000 fd:00 1179698
/lib64/libpthread-2.12.so
2b5d8e815000-2b5d8ea15000 ---p 00017000 fd:00 1179698
/lib64/libpthread-2.12.so
2b5d8ea15000-2b5d8ea16000 r--p 00017000 fd:00 1179698
/lib64/libpthread-2.12.so
2b5d8ea16000-2b5d8ea17000 rw-p 00018000 fd:00 1179698
/lib64/libpthread-2.12.so
2b5d8ea17000-2b5d8ea1c000 rw-p 00000000 00:00 0
2b5d8ea1c000-2b5d8eba6000 r-xp 00000000 fd:00 1179674
/lib64/libc-2.12.so
2b5d8eba6000-2b5d8eda6000 ---p 0018a000 fd:00 1179674
/lib64/libc-2.12.so
2b5d8eda6000-2b5d8edaa000 r--p 0018a000 fd:00 1179674
/lib64/libc-2.12.so
2b5d8edaa000-2b5d8edab000 rw-p 0018e000 fd:00 1179674
/lib64/libc-2.12.so
2b5d8edab000-2b5d8edb2000 rw-p 00000000 00:00 0
2b5d8edb2000-2b5d8edbe000 r-xp 00000000 fd:00 1181056
/lib64/libnss_files-2.12.so
2b5d8edbe000-2b5d8efbe000 ---p 0000c000 fd:00 1181056
/lib64/libnss_files-2.12.so
2b5d8efbe000-2b5d8efbf000 r--p 0000c000 fd:00 1181056
/lib64/libnss_files-2.12.so
2b5d8efbf000-2b5d8efc0000 rw-p 0000d000 fd:00 1181056
/lib64/libnss_files-2.12.so
2b5d8efc0000-2b5d8f083000 r-xp 00000000 fd:00 1183698
/lib64/libnss_db-2.2.3.so
2b5d8f083000-2b5d8f283000 ---p 000c3000 fd:00 1183698
/lib64/libnss_db-2.2.3.so
2b5d8f283000-2b5d8f285000 rw-p 000c3000 fd:00 1183698
/lib64/libnss_db-2.2.3.so
2b5d8f285000-2b5d8f28a000 r-xp 00000000 fd:00 1179688
/lib64/libnss_dns-2.12.so
2b5d8f28a000-2b5d8f489000 ---p 00005000 fd:00 1179688
/lib64/libnss_dns-2.12.so
2b5d8f489000-2b5d8f48a000 r--p 00004000 fd:00 1179688
/lib64/libnss_dns-2.12.so
2b5d8f48a000-2b5d8f48b000 rw-p 00005000 fd:00 1179688
/lib64/libnss_dns-2.12.so
2b5d8f48b000-2b5d8f4a1000 r-xp 00000000 fd:00 1181062
/lib64/libresolv-2.12.so
2b5d8f4a1000-2b5d8f6a1000 ---p 00016000 fd:00 1181062
/lib64/libresolv-2.12.so
2b5d8f6a1000-2b5d8f6a2000 r--p 00016000 fd:00 1181062
/lib64/libresolv-2.12.so
2b5d8f6a2000-2b5d8f6a3000 rw-p 00017000 fd:00 1181062
/lib64/libresolv-2.12.so
2b5d8f6a3000-2b5d8f6a5000 rw-p 00000000 00:00 0
2b5d8f6a5000-2b5d8f6a8000 r-xp 00000000 fd:00 1329386
/usr/lib64/slurm/auth_munge.so
2b5d8f6a8000-2b5d8f8a7000 ---p 00003000 fd:00 1329386
/usr/lib64/slurm/auth_munge.so
2b5d8f8a7000-2b5d8f8a8000 rw-p 00002000 fd:00 1329386
/usr/lib64/slurm/auth_munge.so
2b5d8f8a8000-2b5d8f8b0000 r-xp 00000000 fd:00 1321711
/usr/lib64/libmunge.so.2.0.0
2b5d8f8b0000-2b5d8fab0000 ---p 00008000 fd:00 1321711
/usr/lib64/libmunge.so.2.0.0
2b5d8fab0000-2b5d8fab1000 rw-p 00008000 fd:00 1321711
/usr/lib64/libmunge.so.2.0.0
2b5d8fab1000-2b5d8fab2000 rw-p 00000000 00:00 0
2b5d8fbd4000-2b5d8fcdc000 rw-p 00000000 00:00 0
2b5d8fcdc000-2b5d8fce7000 r-xp 00000000 fd:00 1329261
/usr/lib64/slurm/select_cray.so
2b5d8fce7000-2b5d8fee7000 ---p 0000b000 fd:00 1329261
/usr/lib64/slurm/select_cray.so
2b5d8fee7000-2b5d8fee8000 rw-p 0000b000 fd:00 1329261
/usr/lib64/slurm/select_cray.so
2b5d8fee8000-2b5d8fef1000 r-xp 00000000 fd:00 1321772
/usr/lib64/slurm/select_alps.so
2b5d8fef1000-2b5d900f0000 ---p 00009000 fd:00 1321772
/usr/lib64/slurm/select_alps.so
2b5d900f0000-2b5d900f1000 rw-p 00008000 fd:00 1321772
/usr/lib64/slurm/select_alps.so
2b5d900f1000-2b5d900f2000 rw-p 00000000 00:00 0
2b5d900f2000-2b5d90101000 r-xp 00000000 fd:00 1321773
/usr/lib64/slurm/select_bluegene.so
2b5d90101000-2b5d90301000 ---p 0000f000 fd:00 1321773
/usr/lib64/slurm/select_bluegene.so
2b5d90301000-2b5d90302000 rw-p 0000f000 fd:00 1321773
/usr/lib64/slurm/select_bluegene.so
2b5d90319000-2b5d90401000 r-xp 00000000 fd:00 1312091
/usr/lib64/libstdc++.so.6.0.13
2b5d90401000-2b5d90601000 ---p 000e8000 fd:00 1312091
/usr/lib64/libstdc++.so.6.0.13
2b5d90601000-2b5d90608000 r--p 000e8000 fd:00 1312091
/usr/lib64/libstdc++.so.6.0.13
2b5d90608000-2b5d9060a000 rw-p 000ef000 fd:00 1312091
/usr/lib64/libstdc++.so.6.0.13
2b5d9060a000-2b5d9061f000 rw-p 00000000 00:00 0
2b5d9061f000-2b5d906a2000 r-xp 00000000 fd:00 1181043
/lib64/libm-2.12.so
2b5d906a2000-2b5d908a1000 ---p 00083000 fd:00 1181043
/lib64/libm-2.12.so
2b5d908a1000-2b5d908a2000 r--p 00082000 fd:00 1181043
/lib64/libm-2.12.so 4208962 node q_timing lka PD 0:00
1 Aborted (core dumped)
And, yes, we have long reasons with the new Slurm version, the examples below
coming
from a "scontrol show job|grep Reason=" command:
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
JobState=PENDING
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
Dependency=(null)
Too bad, I probably have to wait for a future version. Let me see if I can keep
the new slurmdbd version...
Best regards,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
http://uppmax.uu.se