we have the same problem with 14.11.2. version 14.11.1 does not have that problem at all.
Fred ----- Original Message ----- From: Lennart Karlsson <[email protected]> To: slurm-dev <[email protected]> Cc: Sent: Wednesday, January 7, 2015 1:15 AM Subject: [slurm-dev] Re: squeue SEGV error in 14.11.2 On 01/06/2015 05:01 PM, Andy Riebs wrote: > We are seeing occasional SEGV's from squeue in 14.11.2 that we > hadn't seen previously. As near as we can tell, it might happen > when the reason for not scheduling jobs is longer than 32 > characters, due to the particular request, such as > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:noden[0692,0777,1788,1836]) > > Does this ring a bell? > > Andy > -- > Andy Riebs > Hewlett-Packard Company > High Performance Computing > +1 404 648 9024 > My opinions are not necessarily those of HP Hi, Today I try to upgrade from version 14.03.7 to version 14.11.2 and seem to get the same problem. A simple "squeue" command without parameters gives: # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) *** buffer overflow detected ***: squeue terminated ======= Backtrace: ========= /lib64/libc.so.6(__fortify_fail+0x37)[0x2b5d8eb1e697] /lib64/libc.so.6(+0x100580)[0x2b5d8eb1c580] /lib64/libc.so.6(+0xffc7b)[0x2b5d8eb1bc7b] /lib64/libc.so.6(__snprintf_chk+0x7a)[0x2b5d8eb1bb4a] squeue(_print_job_reason_list+0x9d)[0x428c3d] squeue[0x427735] squeue(print_job_from_format+0x128)[0x428558] squeue(slurm_list_for_each+0x4e)[0x4436fe] squeue(print_jobs_array+0x566)[0x429856] squeue[0x425142] squeue(main+0x1f8)[0x4256a8] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b5d8ea3ad5d] squeue[0x424f79] ======= Memory map: ======== 00400000-00549000 r-xp 00000000 fd:00 1321798 /usr/bin/squeue 00748000-0074f000 rw-p 00148000 fd:00 1321798 /usr/bin/squeue 0074f000-00753000 rw-p 00000000 00:00 0 017f2000-019de000 rw-p 00000000 00:00 0 [heap] 2b5d8e3d8000-2b5d8e3f8000 r-xp 00000000 fd:00 1179685 /lib64/ld-2.12.so 2b5d8e3f8000-2b5d8e3f9000 rw-p 00000000 00:00 0 2b5d8e5f7000-2b5d8e5f8000 r--p 0001f000 fd:00 1179685 /lib64/ld-2.12.so 2b5d8e5f8000-2b5d8e5f9000 rw-p 00020000 fd:00 1179685 /lib64/ld-2.12.so 2b5d8e5f9000-2b5d8e5fa000 rw-p 00000000 00:00 0 2b5d8e5fa000-2b5d8e5fc000 r-xp 00000000 fd:00 1179731 /lib64/libdl-2.12.so 2b5d8e5fc000-2b5d8e7fc000 ---p 00002000 fd:00 1179731 /lib64/libdl-2.12.so 2b5d8e7fc000-2b5d8e7fd000 r--p 00002000 fd:00 1179731 /lib64/libdl-2.12.so 2b5d8e7fd000-2b5d8e7fe000 rw-p 00003000 fd:00 1179731 /lib64/libdl-2.12.so 2b5d8e7fe000-2b5d8e815000 r-xp 00000000 fd:00 1179698 /lib64/libpthread-2.12.so 2b5d8e815000-2b5d8ea15000 ---p 00017000 fd:00 1179698 /lib64/libpthread-2.12.so 2b5d8ea15000-2b5d8ea16000 r--p 00017000 fd:00 1179698 /lib64/libpthread-2.12.so 2b5d8ea16000-2b5d8ea17000 rw-p 00018000 fd:00 1179698 /lib64/libpthread-2.12.so 2b5d8ea17000-2b5d8ea1c000 rw-p 00000000 00:00 0 2b5d8ea1c000-2b5d8eba6000 r-xp 00000000 fd:00 1179674 /lib64/libc-2.12.so 2b5d8eba6000-2b5d8eda6000 ---p 0018a000 fd:00 1179674 /lib64/libc-2.12.so 2b5d8eda6000-2b5d8edaa000 r--p 0018a000 fd:00 1179674 /lib64/libc-2.12.so 2b5d8edaa000-2b5d8edab000 rw-p 0018e000 fd:00 1179674 /lib64/libc-2.12.so 2b5d8edab000-2b5d8edb2000 rw-p 00000000 00:00 0 2b5d8edb2000-2b5d8edbe000 r-xp 00000000 fd:00 1181056 /lib64/libnss_files-2.12.so 2b5d8edbe000-2b5d8efbe000 ---p 0000c000 fd:00 1181056 /lib64/libnss_files-2.12.so 2b5d8efbe000-2b5d8efbf000 r--p 0000c000 fd:00 1181056 /lib64/libnss_files-2.12.so 2b5d8efbf000-2b5d8efc0000 rw-p 0000d000 fd:00 1181056 /lib64/libnss_files-2.12.so 2b5d8efc0000-2b5d8f083000 r-xp 00000000 fd:00 1183698 /lib64/libnss_db-2.2.3.so 2b5d8f083000-2b5d8f283000 ---p 000c3000 fd:00 1183698 /lib64/libnss_db-2.2.3.so 2b5d8f283000-2b5d8f285000 rw-p 000c3000 fd:00 1183698 /lib64/libnss_db-2.2.3.so 2b5d8f285000-2b5d8f28a000 r-xp 00000000 fd:00 1179688 /lib64/libnss_dns-2.12.so 2b5d8f28a000-2b5d8f489000 ---p 00005000 fd:00 1179688 /lib64/libnss_dns-2.12.so 2b5d8f489000-2b5d8f48a000 r--p 00004000 fd:00 1179688 /lib64/libnss_dns-2.12.so 2b5d8f48a000-2b5d8f48b000 rw-p 00005000 fd:00 1179688 /lib64/libnss_dns-2.12.so 2b5d8f48b000-2b5d8f4a1000 r-xp 00000000 fd:00 1181062 /lib64/libresolv-2.12.so 2b5d8f4a1000-2b5d8f6a1000 ---p 00016000 fd:00 1181062 /lib64/libresolv-2.12.so 2b5d8f6a1000-2b5d8f6a2000 r--p 00016000 fd:00 1181062 /lib64/libresolv-2.12.so 2b5d8f6a2000-2b5d8f6a3000 rw-p 00017000 fd:00 1181062 /lib64/libresolv-2.12.so 2b5d8f6a3000-2b5d8f6a5000 rw-p 00000000 00:00 0 2b5d8f6a5000-2b5d8f6a8000 r-xp 00000000 fd:00 1329386 /usr/lib64/slurm/auth_munge.so 2b5d8f6a8000-2b5d8f8a7000 ---p 00003000 fd:00 1329386 /usr/lib64/slurm/auth_munge.so 2b5d8f8a7000-2b5d8f8a8000 rw-p 00002000 fd:00 1329386 /usr/lib64/slurm/auth_munge.so 2b5d8f8a8000-2b5d8f8b0000 r-xp 00000000 fd:00 1321711 /usr/lib64/libmunge.so.2.0.0 2b5d8f8b0000-2b5d8fab0000 ---p 00008000 fd:00 1321711 /usr/lib64/libmunge.so.2.0.0 2b5d8fab0000-2b5d8fab1000 rw-p 00008000 fd:00 1321711 /usr/lib64/libmunge.so.2.0.0 2b5d8fab1000-2b5d8fab2000 rw-p 00000000 00:00 0 2b5d8fbd4000-2b5d8fcdc000 rw-p 00000000 00:00 0 2b5d8fcdc000-2b5d8fce7000 r-xp 00000000 fd:00 1329261 /usr/lib64/slurm/select_cray.so 2b5d8fce7000-2b5d8fee7000 ---p 0000b000 fd:00 1329261 /usr/lib64/slurm/select_cray.so 2b5d8fee7000-2b5d8fee8000 rw-p 0000b000 fd:00 1329261 /usr/lib64/slurm/select_cray.so 2b5d8fee8000-2b5d8fef1000 r-xp 00000000 fd:00 1321772 /usr/lib64/slurm/select_alps.so 2b5d8fef1000-2b5d900f0000 ---p 00009000 fd:00 1321772 /usr/lib64/slurm/select_alps.so 2b5d900f0000-2b5d900f1000 rw-p 00008000 fd:00 1321772 /usr/lib64/slurm/select_alps.so 2b5d900f1000-2b5d900f2000 rw-p 00000000 00:00 0 2b5d900f2000-2b5d90101000 r-xp 00000000 fd:00 1321773 /usr/lib64/slurm/select_bluegene.so 2b5d90101000-2b5d90301000 ---p 0000f000 fd:00 1321773 /usr/lib64/slurm/select_bluegene.so 2b5d90301000-2b5d90302000 rw-p 0000f000 fd:00 1321773 /usr/lib64/slurm/select_bluegene.so 2b5d90319000-2b5d90401000 r-xp 00000000 fd:00 1312091 /usr/lib64/libstdc++.so.6.0.13 2b5d90401000-2b5d90601000 ---p 000e8000 fd:00 1312091 /usr/lib64/libstdc++.so.6.0.13 2b5d90601000-2b5d90608000 r--p 000e8000 fd:00 1312091 /usr/lib64/libstdc++.so.6.0.13 2b5d90608000-2b5d9060a000 rw-p 000ef000 fd:00 1312091 /usr/lib64/libstdc++.so.6.0.13 2b5d9060a000-2b5d9061f000 rw-p 00000000 00:00 0 2b5d9061f000-2b5d906a2000 r-xp 00000000 fd:00 1181043 /lib64/libm-2.12.so 2b5d906a2000-2b5d908a1000 ---p 00083000 fd:00 1181043 /lib64/libm-2.12.so 2b5d908a1000-2b5d908a2000 r--p 00082000 fd:00 1181043 /lib64/libm-2.12.so 4208962 node q_timing lka PD 0:00 1 Aborted (core dumped) And, yes, we have long reasons with the new Slurm version, the examples below coming from a "scontrol show job|grep Reason=" command: JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) Dependency=(null) Too bad, I probably have to wait for a future version. Let me see if I can keep the new slurmdbd version... Best regards, -- Lennart Karlsson, UPPMAX, Uppsala University, Sweden http://uppmax.uu.se
