FYI, This bug is reported in the Slurm Bugzilla...
http://bugs.schedmd.com/show_bug.cgi?id=1326 Trevor > On Jan 7, 2015, at 1:32 PM, Fred Smith <[email protected]> wrote: > > > we have the same problem with 14.11.2. version 14.11.1 does not have that > problem at all. > > Fred > > > ----- Original Message ----- > From: Lennart Karlsson <[email protected]> > To: slurm-dev <[email protected]> > Cc: > Sent: Wednesday, January 7, 2015 1:15 AM > Subject: [slurm-dev] Re: squeue SEGV error in 14.11.2 > > > On 01/06/2015 05:01 PM, Andy Riebs wrote: >> We are seeing occasional SEGV's from squeue in 14.11.2 that we >> hadn't seen previously. As near as we can tell, it might happen >> when the reason for not scheduling jobs is longer than 32 >> characters, due to the particular request, such as >> JobState=PENDING >> Reason=ReqNodeNotAvail(Unavailable:noden[0692,0777,1788,1836]) >> >> Does this ring a bell? >> >> Andy >> -- >> Andy Riebs >> Hewlett-Packard Company >> High Performance Computing >> +1 404 648 9024 >> My opinions are not necessarily those of HP > > Hi, > > Today I try to upgrade from version 14.03.7 to version 14.11.2 and seem > to get the same problem. A simple "squeue" command without parameters gives: > > # squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > *** buffer overflow detected ***: squeue terminated > ======= Backtrace: ========= > /lib64/libc.so.6(__fortify_fail+0x37)[0x2b5d8eb1e697] > /lib64/libc.so.6(+0x100580)[0x2b5d8eb1c580] > /lib64/libc.so.6(+0xffc7b)[0x2b5d8eb1bc7b] > /lib64/libc.so.6(__snprintf_chk+0x7a)[0x2b5d8eb1bb4a] > squeue(_print_job_reason_list+0x9d)[0x428c3d] > squeue[0x427735] > squeue(print_job_from_format+0x128)[0x428558] > squeue(slurm_list_for_each+0x4e)[0x4436fe] > squeue(print_jobs_array+0x566)[0x429856] > squeue[0x425142] > squeue(main+0x1f8)[0x4256a8] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b5d8ea3ad5d] > squeue[0x424f79] > ======= Memory map: ======== > 00400000-00549000 r-xp 00000000 fd:00 1321798 > /usr/bin/squeue > 00748000-0074f000 rw-p 00148000 fd:00 1321798 > /usr/bin/squeue > 0074f000-00753000 rw-p 00000000 00:00 0 > 017f2000-019de000 rw-p 00000000 00:00 0 > [heap] > 2b5d8e3d8000-2b5d8e3f8000 r-xp 00000000 fd:00 1179685 > /lib64/ld-2.12.so > 2b5d8e3f8000-2b5d8e3f9000 rw-p 00000000 00:00 0 > 2b5d8e5f7000-2b5d8e5f8000 r--p 0001f000 fd:00 1179685 > /lib64/ld-2.12.so > 2b5d8e5f8000-2b5d8e5f9000 rw-p 00020000 fd:00 1179685 > /lib64/ld-2.12.so > 2b5d8e5f9000-2b5d8e5fa000 rw-p 00000000 00:00 0 > 2b5d8e5fa000-2b5d8e5fc000 r-xp 00000000 fd:00 1179731 > /lib64/libdl-2.12.so > 2b5d8e5fc000-2b5d8e7fc000 ---p 00002000 fd:00 1179731 > /lib64/libdl-2.12.so > 2b5d8e7fc000-2b5d8e7fd000 r--p 00002000 fd:00 1179731 > /lib64/libdl-2.12.so > 2b5d8e7fd000-2b5d8e7fe000 rw-p 00003000 fd:00 1179731 > /lib64/libdl-2.12.so > 2b5d8e7fe000-2b5d8e815000 r-xp 00000000 fd:00 1179698 > /lib64/libpthread-2.12.so > 2b5d8e815000-2b5d8ea15000 ---p 00017000 fd:00 1179698 > /lib64/libpthread-2.12.so > 2b5d8ea15000-2b5d8ea16000 r--p 00017000 fd:00 1179698 > /lib64/libpthread-2.12.so > 2b5d8ea16000-2b5d8ea17000 rw-p 00018000 fd:00 1179698 > /lib64/libpthread-2.12.so > 2b5d8ea17000-2b5d8ea1c000 rw-p 00000000 00:00 0 > 2b5d8ea1c000-2b5d8eba6000 r-xp 00000000 fd:00 1179674 > /lib64/libc-2.12.so > 2b5d8eba6000-2b5d8eda6000 ---p 0018a000 fd:00 1179674 > /lib64/libc-2.12.so > 2b5d8eda6000-2b5d8edaa000 r--p 0018a000 fd:00 1179674 > /lib64/libc-2.12.so > 2b5d8edaa000-2b5d8edab000 rw-p 0018e000 fd:00 1179674 > /lib64/libc-2.12.so > 2b5d8edab000-2b5d8edb2000 rw-p 00000000 00:00 0 > 2b5d8edb2000-2b5d8edbe000 r-xp 00000000 fd:00 1181056 > /lib64/libnss_files-2.12.so > 2b5d8edbe000-2b5d8efbe000 ---p 0000c000 fd:00 1181056 > /lib64/libnss_files-2.12.so > 2b5d8efbe000-2b5d8efbf000 r--p 0000c000 fd:00 1181056 > /lib64/libnss_files-2.12.so > 2b5d8efbf000-2b5d8efc0000 rw-p 0000d000 fd:00 1181056 > /lib64/libnss_files-2.12.so > 2b5d8efc0000-2b5d8f083000 r-xp 00000000 fd:00 1183698 > /lib64/libnss_db-2.2.3.so > 2b5d8f083000-2b5d8f283000 ---p 000c3000 fd:00 1183698 > /lib64/libnss_db-2.2.3.so > 2b5d8f283000-2b5d8f285000 rw-p 000c3000 fd:00 1183698 > /lib64/libnss_db-2.2.3.so > 2b5d8f285000-2b5d8f28a000 r-xp 00000000 fd:00 1179688 > /lib64/libnss_dns-2.12.so > 2b5d8f28a000-2b5d8f489000 ---p 00005000 fd:00 1179688 > /lib64/libnss_dns-2.12.so > 2b5d8f489000-2b5d8f48a000 r--p 00004000 fd:00 1179688 > /lib64/libnss_dns-2.12.so > 2b5d8f48a000-2b5d8f48b000 rw-p 00005000 fd:00 1179688 > /lib64/libnss_dns-2.12.so > 2b5d8f48b000-2b5d8f4a1000 r-xp 00000000 fd:00 1181062 > /lib64/libresolv-2.12.so > 2b5d8f4a1000-2b5d8f6a1000 ---p 00016000 fd:00 1181062 > /lib64/libresolv-2.12.so > 2b5d8f6a1000-2b5d8f6a2000 r--p 00016000 fd:00 1181062 > /lib64/libresolv-2.12.so > 2b5d8f6a2000-2b5d8f6a3000 rw-p 00017000 fd:00 1181062 > /lib64/libresolv-2.12.so > 2b5d8f6a3000-2b5d8f6a5000 rw-p 00000000 00:00 0 > 2b5d8f6a5000-2b5d8f6a8000 r-xp 00000000 fd:00 1329386 > /usr/lib64/slurm/auth_munge.so > 2b5d8f6a8000-2b5d8f8a7000 ---p 00003000 fd:00 1329386 > /usr/lib64/slurm/auth_munge.so > 2b5d8f8a7000-2b5d8f8a8000 rw-p 00002000 fd:00 1329386 > /usr/lib64/slurm/auth_munge.so > 2b5d8f8a8000-2b5d8f8b0000 r-xp 00000000 fd:00 1321711 > /usr/lib64/libmunge.so.2.0.0 > 2b5d8f8b0000-2b5d8fab0000 ---p 00008000 fd:00 1321711 > /usr/lib64/libmunge.so.2.0.0 > 2b5d8fab0000-2b5d8fab1000 rw-p 00008000 fd:00 1321711 > /usr/lib64/libmunge.so.2.0.0 > 2b5d8fab1000-2b5d8fab2000 rw-p 00000000 00:00 0 > 2b5d8fbd4000-2b5d8fcdc000 rw-p 00000000 00:00 0 > 2b5d8fcdc000-2b5d8fce7000 r-xp 00000000 fd:00 1329261 > /usr/lib64/slurm/select_cray.so > 2b5d8fce7000-2b5d8fee7000 ---p 0000b000 fd:00 1329261 > /usr/lib64/slurm/select_cray.so > 2b5d8fee7000-2b5d8fee8000 rw-p 0000b000 fd:00 1329261 > /usr/lib64/slurm/select_cray.so > 2b5d8fee8000-2b5d8fef1000 r-xp 00000000 fd:00 1321772 > /usr/lib64/slurm/select_alps.so > 2b5d8fef1000-2b5d900f0000 ---p 00009000 fd:00 1321772 > /usr/lib64/slurm/select_alps.so > 2b5d900f0000-2b5d900f1000 rw-p 00008000 fd:00 1321772 > /usr/lib64/slurm/select_alps.so > 2b5d900f1000-2b5d900f2000 rw-p 00000000 00:00 0 > 2b5d900f2000-2b5d90101000 r-xp 00000000 fd:00 1321773 > /usr/lib64/slurm/select_bluegene.so > 2b5d90101000-2b5d90301000 ---p 0000f000 fd:00 1321773 > /usr/lib64/slurm/select_bluegene.so > 2b5d90301000-2b5d90302000 rw-p 0000f000 fd:00 1321773 > /usr/lib64/slurm/select_bluegene.so > 2b5d90319000-2b5d90401000 r-xp 00000000 fd:00 1312091 > /usr/lib64/libstdc++.so.6.0.13 > 2b5d90401000-2b5d90601000 ---p 000e8000 fd:00 1312091 > /usr/lib64/libstdc++.so.6.0.13 > 2b5d90601000-2b5d90608000 r--p 000e8000 fd:00 1312091 > /usr/lib64/libstdc++.so.6.0.13 > 2b5d90608000-2b5d9060a000 rw-p 000ef000 fd:00 1312091 > /usr/lib64/libstdc++.so.6.0.13 > 2b5d9060a000-2b5d9061f000 rw-p 00000000 00:00 0 > 2b5d9061f000-2b5d906a2000 r-xp 00000000 fd:00 1181043 > /lib64/libm-2.12.so > 2b5d906a2000-2b5d908a1000 ---p 00083000 fd:00 1181043 > /lib64/libm-2.12.so > 2b5d908a1000-2b5d908a2000 r--p 00082000 fd:00 1181043 > /lib64/libm-2.12.so 4208962 node q_timing lka PD > 0:00 1 Aborted (core dumped) > > > > And, yes, we have long reasons with the new Slurm version, the examples below > coming > from a "scontrol show job|grep Reason=" command: > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > JobState=PENDING > Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200]) > Dependency=(null) > > Too bad, I probably have to wait for a future version. Let me see if I can > keep the new slurmdbd version... > > Best regards, > -- Lennart Karlsson, UPPMAX, Uppsala University, Sweden > http://uppmax.uu.se
