On 01/06/2015 05:01 PM, Andy Riebs wrote:
    We are seeing occasional SEGV's from squeue in 14.11.2 that we
  hadn't seen previously. As near as we can tell, it might happen
      when the reason for not scheduling jobs is longer than 32
      characters, due to the particular request, such as
    JobState=PENDING
    Reason=ReqNodeNotAvail(Unavailable:noden[0692,0777,1788,1836])

    Does this ring a bell?

    Andy
  --
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

Hi,

Today I try to upgrade from version 14.03.7 to version 14.11.2 and seem
to get the same problem. A simple "squeue" command without parameters gives:

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
*** buffer overflow detected ***: squeue terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x2b5d8eb1e697]
/lib64/libc.so.6(+0x100580)[0x2b5d8eb1c580]
/lib64/libc.so.6(+0xffc7b)[0x2b5d8eb1bc7b]
/lib64/libc.so.6(__snprintf_chk+0x7a)[0x2b5d8eb1bb4a]
squeue(_print_job_reason_list+0x9d)[0x428c3d]
squeue[0x427735]
squeue(print_job_from_format+0x128)[0x428558]
squeue(slurm_list_for_each+0x4e)[0x4436fe]
squeue(print_jobs_array+0x566)[0x429856]
squeue[0x425142]
squeue(main+0x1f8)[0x4256a8]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b5d8ea3ad5d]
squeue[0x424f79]
======= Memory map: ========
00400000-00549000 r-xp 00000000 fd:00 1321798                            
/usr/bin/squeue
00748000-0074f000 rw-p 00148000 fd:00 1321798                            
/usr/bin/squeue
0074f000-00753000 rw-p 00000000 00:00 0
017f2000-019de000 rw-p 00000000 00:00 0                                  [heap]
2b5d8e3d8000-2b5d8e3f8000 r-xp 00000000 fd:00 1179685                    
/lib64/ld-2.12.so
2b5d8e3f8000-2b5d8e3f9000 rw-p 00000000 00:00 0
2b5d8e5f7000-2b5d8e5f8000 r--p 0001f000 fd:00 1179685                    
/lib64/ld-2.12.so
2b5d8e5f8000-2b5d8e5f9000 rw-p 00020000 fd:00 1179685                    
/lib64/ld-2.12.so
2b5d8e5f9000-2b5d8e5fa000 rw-p 00000000 00:00 0
2b5d8e5fa000-2b5d8e5fc000 r-xp 00000000 fd:00 1179731                    
/lib64/libdl-2.12.so
2b5d8e5fc000-2b5d8e7fc000 ---p 00002000 fd:00 1179731                    
/lib64/libdl-2.12.so
2b5d8e7fc000-2b5d8e7fd000 r--p 00002000 fd:00 1179731                    
/lib64/libdl-2.12.so
2b5d8e7fd000-2b5d8e7fe000 rw-p 00003000 fd:00 1179731                    
/lib64/libdl-2.12.so
2b5d8e7fe000-2b5d8e815000 r-xp 00000000 fd:00 1179698                    
/lib64/libpthread-2.12.so
2b5d8e815000-2b5d8ea15000 ---p 00017000 fd:00 1179698                    
/lib64/libpthread-2.12.so
2b5d8ea15000-2b5d8ea16000 r--p 00017000 fd:00 1179698                    
/lib64/libpthread-2.12.so
2b5d8ea16000-2b5d8ea17000 rw-p 00018000 fd:00 1179698                    
/lib64/libpthread-2.12.so
2b5d8ea17000-2b5d8ea1c000 rw-p 00000000 00:00 0
2b5d8ea1c000-2b5d8eba6000 r-xp 00000000 fd:00 1179674                    
/lib64/libc-2.12.so
2b5d8eba6000-2b5d8eda6000 ---p 0018a000 fd:00 1179674                    
/lib64/libc-2.12.so
2b5d8eda6000-2b5d8edaa000 r--p 0018a000 fd:00 1179674                    
/lib64/libc-2.12.so
2b5d8edaa000-2b5d8edab000 rw-p 0018e000 fd:00 1179674                    
/lib64/libc-2.12.so
2b5d8edab000-2b5d8edb2000 rw-p 00000000 00:00 0
2b5d8edb2000-2b5d8edbe000 r-xp 00000000 fd:00 1181056                    
/lib64/libnss_files-2.12.so
2b5d8edbe000-2b5d8efbe000 ---p 0000c000 fd:00 1181056                    
/lib64/libnss_files-2.12.so
2b5d8efbe000-2b5d8efbf000 r--p 0000c000 fd:00 1181056                    
/lib64/libnss_files-2.12.so
2b5d8efbf000-2b5d8efc0000 rw-p 0000d000 fd:00 1181056                    
/lib64/libnss_files-2.12.so
2b5d8efc0000-2b5d8f083000 r-xp 00000000 fd:00 1183698                    
/lib64/libnss_db-2.2.3.so
2b5d8f083000-2b5d8f283000 ---p 000c3000 fd:00 1183698                    
/lib64/libnss_db-2.2.3.so
2b5d8f283000-2b5d8f285000 rw-p 000c3000 fd:00 1183698                    
/lib64/libnss_db-2.2.3.so
2b5d8f285000-2b5d8f28a000 r-xp 00000000 fd:00 1179688                    
/lib64/libnss_dns-2.12.so
2b5d8f28a000-2b5d8f489000 ---p 00005000 fd:00 1179688                    
/lib64/libnss_dns-2.12.so
2b5d8f489000-2b5d8f48a000 r--p 00004000 fd:00 1179688                    
/lib64/libnss_dns-2.12.so
2b5d8f48a000-2b5d8f48b000 rw-p 00005000 fd:00 1179688                    
/lib64/libnss_dns-2.12.so
2b5d8f48b000-2b5d8f4a1000 r-xp 00000000 fd:00 1181062                    
/lib64/libresolv-2.12.so
2b5d8f4a1000-2b5d8f6a1000 ---p 00016000 fd:00 1181062                    
/lib64/libresolv-2.12.so
2b5d8f6a1000-2b5d8f6a2000 r--p 00016000 fd:00 1181062                    
/lib64/libresolv-2.12.so
2b5d8f6a2000-2b5d8f6a3000 rw-p 00017000 fd:00 1181062                    
/lib64/libresolv-2.12.so
2b5d8f6a3000-2b5d8f6a5000 rw-p 00000000 00:00 0
2b5d8f6a5000-2b5d8f6a8000 r-xp 00000000 fd:00 1329386                    
/usr/lib64/slurm/auth_munge.so
2b5d8f6a8000-2b5d8f8a7000 ---p 00003000 fd:00 1329386                    
/usr/lib64/slurm/auth_munge.so
2b5d8f8a7000-2b5d8f8a8000 rw-p 00002000 fd:00 1329386                    
/usr/lib64/slurm/auth_munge.so
2b5d8f8a8000-2b5d8f8b0000 r-xp 00000000 fd:00 1321711                    
/usr/lib64/libmunge.so.2.0.0
2b5d8f8b0000-2b5d8fab0000 ---p 00008000 fd:00 1321711                    
/usr/lib64/libmunge.so.2.0.0
2b5d8fab0000-2b5d8fab1000 rw-p 00008000 fd:00 1321711                    
/usr/lib64/libmunge.so.2.0.0
2b5d8fab1000-2b5d8fab2000 rw-p 00000000 00:00 0
2b5d8fbd4000-2b5d8fcdc000 rw-p 00000000 00:00 0
2b5d8fcdc000-2b5d8fce7000 r-xp 00000000 fd:00 1329261                    
/usr/lib64/slurm/select_cray.so
2b5d8fce7000-2b5d8fee7000 ---p 0000b000 fd:00 1329261                    
/usr/lib64/slurm/select_cray.so
2b5d8fee7000-2b5d8fee8000 rw-p 0000b000 fd:00 1329261                    
/usr/lib64/slurm/select_cray.so
2b5d8fee8000-2b5d8fef1000 r-xp 00000000 fd:00 1321772                    
/usr/lib64/slurm/select_alps.so
2b5d8fef1000-2b5d900f0000 ---p 00009000 fd:00 1321772                    
/usr/lib64/slurm/select_alps.so
2b5d900f0000-2b5d900f1000 rw-p 00008000 fd:00 1321772                    
/usr/lib64/slurm/select_alps.so
2b5d900f1000-2b5d900f2000 rw-p 00000000 00:00 0
2b5d900f2000-2b5d90101000 r-xp 00000000 fd:00 1321773                    
/usr/lib64/slurm/select_bluegene.so
2b5d90101000-2b5d90301000 ---p 0000f000 fd:00 1321773                    
/usr/lib64/slurm/select_bluegene.so
2b5d90301000-2b5d90302000 rw-p 0000f000 fd:00 1321773                    
/usr/lib64/slurm/select_bluegene.so
2b5d90319000-2b5d90401000 r-xp 00000000 fd:00 1312091                    
/usr/lib64/libstdc++.so.6.0.13
2b5d90401000-2b5d90601000 ---p 000e8000 fd:00 1312091                    
/usr/lib64/libstdc++.so.6.0.13
2b5d90601000-2b5d90608000 r--p 000e8000 fd:00 1312091                    
/usr/lib64/libstdc++.so.6.0.13
2b5d90608000-2b5d9060a000 rw-p 000ef000 fd:00 1312091                    
/usr/lib64/libstdc++.so.6.0.13
2b5d9060a000-2b5d9061f000 rw-p 00000000 00:00 0
2b5d9061f000-2b5d906a2000 r-xp 00000000 fd:00 1181043                    
/lib64/libm-2.12.so
2b5d906a2000-2b5d908a1000 ---p 00083000 fd:00 1181043                    
/lib64/libm-2.12.so
2b5d908a1000-2b5d908a2000 r--p 00082000 fd:00 1181043                    
/lib64/libm-2.12.so           4208962      node q_timing      lka PD       0:00 
     1 Aborted (core dumped)



And, yes, we have long reasons with the new Slurm version, the examples below 
coming
from a "scontrol show job|grep Reason=" command:
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)
   JobState=PENDING 
Reason=ReqNodeNotAvail(Unavailable:m[13,19,55,79,125,128,134,138,140,167,198,200])
 Dependency=(null)

Too bad, I probably have to wait for a future version. Let me see if I can keep 
the new slurmdbd version...

Best regards,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
   http://uppmax.uu.se

Reply via email to