I think it's a different problem: the bug I found was fixed in v
2.5.3, commit 395129114e7ca697bf8f459c399b7ba450f61f02

.a.

On Wed, Mar 6, 2013 at 9:12 PM, Danny Auble <[email protected]> wrote:
>
> Do you happen to have a backtrace or anything from the logs on the matter?
>
>
> On 03/06/2013 08:57 AM, Lennart Karlsson wrote:
>> Hi,
>>
>> Today I upgraded SLURM from v 2.4.3 to v 2.5.3.
>>
>> It seems like a mistake, because slurmctld crashes. Any ideas
>> about what to do, except downgrading back to 2.4.3?
>>
>> I think that I run a normal slurmctld -> (munge) -> slurmdbd -> MySQL
>> setup and it worked in the start: slurmdbd built some new tables
>> and jobs survived, but slurmctld went down after a while and now
>> refuses to keep living.
>>
>> An "strace -f" ends with:
>> [pid 18169] close(11)                   = 0
>> [pid 18169] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 
>> 0x35a060f500}, {SIG_DFL, [ALRM], SA_RESTORER, 0x35a060f500}, 8) = 0
>> [pid 18169] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 
>> 0x35a060f500}, {SIG_DFL, [], 0}, 8) = 0
>> [pid 18169] fcntl(8, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18169] fcntl(8, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18169] fcntl(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 18169] poll([{fd=8, events=POLLOUT}], 1, 60000) = 1 ([{fd=8, 
>> revents=POLLOUT}])
>> [pid 18169] recvfrom(8, 0x2b0bdc3016a0, 1, 0, 0, 0) = -1 EAGAIN (Resource 
>> temporarily unavailable)
>> [pid 18169] sendto(8, "\0\0\0\257", 4, 0, NULL, 0) = 4
>> [pid 18169] fcntl(8, F_SETFL, O_RDWR)   = 0
>> [pid 18169] fcntl(8, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18169] fcntl(8, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18169] fcntl(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 18169] poll([{fd=8, events=POLLOUT}], 1, 60000) = 1 ([{fd=8, 
>> revents=POLLOUT}])
>> [pid 18169] recvfrom(8, 0x2b0bdc3016a0, 1, 0, 0, 0) = -1 EAGAIN (Resource 
>> temporarily unavailable)
>> [pid 18169] sendto(8, 
>> "\31\0\0\0\37A\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\vauth/mun"..., 175, 0, 
>> NULL, 0) = 175
>> [pid 18169] fcntl(8, F_SETFL, O_RDWR)   = 0
>> [pid 18169] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER, 
>> 0x35a060f500}, {SIG_IGN, [PIPE], SA_RESTORER, 0x35a060f500}, 8) = 0
>> [pid 18169] close(8)                    = 0
>> [pid 18169] madvise(0x2b0bdc202000, 1028096, MADV_DONTNEED) = 0
>> [pid 18169] _exit(0)                    = ?
>> Process 18169 detached
>> [pid 18164] <... select resumed> )      = 1 (in [4])
>> [pid 18164] accept(4, {sa_family=AF_INET, sin_port=htons(51303), 
>> sin_addr=inet_addr("130.238.136.157")}, [16]) = 8
>> [pid 18164] clone(Process 18172 attached (waiting for parent)
>> Process 18172 resumed (parent 18164 ready)
>> child_stack=0x2b0bdc301ff0, 
>> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
>>  parent_tidptr=0x2b0bdc3029d0, tls=0x2b0bdc302700, 
>> child_tidptr=0x2b0bdc3029d0) = 18172
>> [pid 18172] set_robust_list(0x2b0bdc3029e0, 0x18) = 0
>> [pid 18172] fcntl(8, F_GETFL <unfinished ...>
>> [pid 18164] select(5, [4], NULL, NULL, NULL <unfinished ...>
>> [pid 18172] <... fcntl resumed> )       = 0x2 (flags O_RDWR)
>> [pid 18172] fcntl(8, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18172] fcntl(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 18172] poll([{fd=8, events=POLLIN}], 1, 60000 <unfinished ...>
>> [pid 18168] <... poll resumed> )        = 1 ([{fd=7, revents=POLLIN}])
>> [pid 18168] recvfrom(7, "\0\0\1\6", 4, 0, NULL, NULL) = 4
>> [pid 18168] fcntl(7, F_SETFL, O_RDWR)   = 0
>> [pid 18168] fcntl(7, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18168] fcntl(7, F_GETFL)           = 0x2 (flags O_RDWR)
>> [pid 18168] fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 18168] poll([{fd=7, events=POLLIN}], 1, 60000) = 1 ([{fd=7, 
>> revents=POLLIN}])
>> [pid 18168] recvfrom(7, 
>> "\30\0\0\0\3\352\0\0\0[\0\0\0\0\0\0\0\0\0\0\0\0\0\vauth/mun"..., 262, 0, 
>> NULL, NULL) = 262
>> [pid 18168] fcntl(7, F_SETFL, O_RDWR)   = 0
>> [pid 18168] stat("/var/run/munge/munge.socket.2", {st_mode=S_IFSOCK|0777, 
>> st_size=0, ...}) = 0
>> [pid 18168] socket(PF_FILE, SOCK_STREAM, 0) = 11
>> [pid 18168] fcntl(11, F_GETFL)          = 0x2 (flags O_RDWR)
>> [pid 18168] fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 18168] connect(11, {sa_family=AF_FILE, 
>> path="/var/run/munge/munge.socket.2"}, 110) = 0
>> [pid 18168] writev(11, [{"\0`mK\4\4\0\0\0\0\204", 11}, 
>> {"\0\0\0\200MUNGE:AwQDAAA89DxchL99mxvooO"..., 132}], 2) = 143
>> [pid 18168] read(11, 0x2b0bdc200bf0, 11) = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> [pid 18168] poll([{fd=11, events=POLLIN}], 1, 3000) = 1 ([{fd=11, 
>> revents=POLLIN|POLLHUP}])
>> [pid 18170] <... poll resumed> )        = 1 ([{fd=9, revents=POLLIN}])
>> [pid 18168] read(11, "\0`mK\4\5\0\0\0\0+", 11) = 11
>> [pid 18168] read(11,  <unfinished ...>
>> [pid 18170] recvfrom(9,  <unfinished ...>
>> [pid 18168] <... read resumed> 
>> "\0\0\4\3\0\0\0\0\1,\4\202\356\210xQ7PEQ7PE\0\0\0\0\0\0\0\0\377"..., 43) = 43
>> [pid 18168] close(11)                   = 0
>> [pid 18168] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> Process 18168 detached
>> [pid 18172] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18171] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18170] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18166] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18165] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18163] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18162] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18161] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18160] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18159] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18156] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18155] +++ killed by SIGSEGV (core dumped) +++
>> [pid 18164] +++ killed by SIGSEGV (core dumped) +++
>> +++ killed by SIGSEGV (core dumped) +++
>>
>>
>> Log file /var/log/messages says:
>> Mar  6 15:18:35 kalkyl2 kernel: slurmctld[18150]: segfault at 0 ip 
>> 000000000043c2eb sp 00002ab4d8200820 error 4 in slurmctld[400
>>
>>
>> Best regards,
>> -- Lennart Karlsson, UPPMAX, Uppsala University, Sweden



-- 
[email protected]
GC3: Grid Computing Competence Center
http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/418243538841/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland

Reply via email to