bug#37237: mcron randomly stops running jobs

2020-03-18 Thread Ludovic Courtès
Hi,

Robert Vollmert  skribis:

> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

When mcron is stuck, it’s apparently stuck in ‘pthread_join’, waiting
for the finalization thread to terminate:

--8<---cut here---start->8---
(gdb) bt
#0  0x7f7f730d8478 in __GI___pthread_timedjoin_ex 
(threadid=140185229293312, thread_return=thread_return@entry=0x0, 
abstime=abstime@entry=0x0, 
block=block@entry=true) at pthread_join_common.c:84
#1  0x7f7f730d82dc in __pthread_join (threadid=, 
thread_return=thread_return@entry=0x0) at pthread_join.c:24
#2  0x7f7f731d7508 in stop_finalization_thread () at finalizers.c:265
#3  0x7f7f731d7729 in scm_i_finalizer_pre_fork () at finalizers.c:290
#4  0x7f7f73250426 in scm_fork () at posix.c:1220
#5  0x7f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, 
vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#6  0x7f7f7324bfd7 in scm_call_n (proc=#, 
argv=argv@entry=0x7ffd9b10bed8, nargs=nargs@entry=1) at vm.c:1260
#7  0x7f7f731cdf58 in scm_call_1 (proc=, arg1=) at eval.c:485
#8  0x0040130c in inner_main ()
#9  0x7f7f731e5cfd in invoke_main_func (body_data=0x7ffd9b10c3b0) at 
init.c:341
#10 0x7f7f731c835a in c_body (d=0x7ffd9b10c2f0) at continuations.c:422
#11 0x7f7f7324979f in vm_regular_engine (thread=0x7f7f6acb79d0, 
vp=0x7f7f6b299f30, registers=0xafaf, resume=1930265720) at vm-engine.c:786
#12 0x7f7f7324bfd7 in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1260
#13 0x7f7f731cdf39 in scm_call_0 (proc=proc@entry=#) at eval.c:479
#14 0x7f7f7323ac12 in catch (tag=tag@entry=#t, thunk=#, 
handler=0x7f7f6b2a1420, 
pre_unwind_handler=0x7f7f6b2a1400) at throw.c:137
#15 0x7f7f7323aef5 in scm_catch_with_pre_unwind_handler (key=key@entry=#t, 
thunk=, handler=, pre_unwind_handler=)
at throw.c:254
#16 0x7f7f7323b0bf in scm_c_catch (tag=tag@entry=#t, 
body=body@entry=0x7f7f731c8350 , 
body_data=body_data@entry=0x7ffd9b10c2f0, 
handler=handler@entry=0x7f7f731c85e0 , 
handler_data=handler_data@entry=0x7ffd9b10c2f0, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 
, pre_unwind_handler_data=0x7f7f6b2939a0) at throw.c:377
#17 0x7f7f731c8940 in scm_i_with_continuation_barrier 
(body=body@entry=0x7f7f731c8350 , 
body_data=body_data@entry=0x7ffd9b10c2f0, 
handler=handler@entry=0x7f7f731c85e0 , 
handler_data=handler_data@entry=0x7ffd9b10c2f0, 
pre_unwind_handler=pre_unwind_handler@entry=0x7f7f731c8440 
, pre_unwind_handler_data=0x7f7f6b2939a0) at 
continuations.c:360
#18 0x7f7f731c89d5 in scm_c_with_continuation_barrier (func=, data=) at continuations.c:456
#19 0x7f7f7323984c in with_guile (base=base@entry=0x7ffd9b10c358, 
data=data@entry=0x7ffd9b10c380) at threads.c:661
#20 0x7f7f73127a68 in GC_call_with_stack_base (fn=fn@entry=0x7f7f73239800 
, arg=arg@entry=0x7ffd9b10c380) at misc.c:1941
#21 0x7f7f73239b68 in scm_i_with_guile (dynamic_state=, 
data=data@entry=0x7ffd9b10c380, func=func@entry=0x7f7f731e5ce0 
)
at threads.c:704
#22 scm_with_guile (func=func@entry=0x7f7f731e5ce0 , 
data=data@entry=0x7ffd9b10c3b0) at threads.c:710
#23 0x7f7f731e5e92 in scm_boot_guile (argc=12, argv=0x7ffd9b10c4f8, 
main_func=0x4012a0 , closure=0x0) at init.c:324
#24 0x0040118b in main ()
--8<---cut here---end--->8---

The finalization thread lives its life as if it hadn’t received the
“stop” message in its pipe:

--8<---cut here---start->8---
(gdb) thread 17
[Switching to thread 17 (Thread 0x7f7f6acb7700 (LWP 44975))]
#0  0x7f7f730e0344 in __libc_read (fd=5, buf=buf@entry=0x7f7f6acb6a40, 
nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
26  ../sysdeps/unix/sysv/linux/read.c: No such file or directory.
(gdb) bt
#0  0x7f7f730e0344 in __libc_read (fd=5, buf=buf@entry=0x7f7f6acb6a40, 
nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x7f7f731d7497 in read_finalization_pipe_data (data=0x7f7f6acb6a40) at 
finalizers.c:199
#2  0x7f7f7312d503 in GC_do_blocking_inner (data=0x7f7f6acb6a00 
"\200t\035s\177\177", context=) at pthread_support.c:1362
#3  0x7f7f73121d62 in GC_with_callee_saves_pushed (fn=0x7f7f7312d4c0 
, arg=arg@entry=0x7f7f6acb6a00 "\200t\035s\177\177") at 
mach_dep.c:328
#4  0x7f7f73127a9c in GC_do_blocking (fn=fn@entry=0x7f7f731d7480 
, client_data=client_data@entry=0x7f7f6acb6a40) at 
misc.c:2053
#5  0x7f7f73239bba in scm_without_guile (func=0x7f7f731d7480 
, d

bug#37237: mcron randomly stops running jobs

2020-02-23 Thread Ludovic Courtès
Hi,

Robert Vollmert  skribis:

> I have numerous mcron jobs configured, many of which run every 15 minutes,
> and one which runs once a minute. Just now, at 19:22, I noticed the minutely
> cron job didn’t seem to run, and saw that one every-15-minute job which always
> logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
> running again. There is no timezone confusion either: That every-15-minute job
> logged its first run at 19:30.

I’ve been bitten by this problem several times already.  I suspect the
patch at

helps with zombies that mcron tends to leave behind, but I’m not sure it
helps with this “stops running jobs” problem (I’d need to grab a C stack
trace of mcron when that happens.)

Anyway, let’s get this fixed!

Ludo’.





bug#37237: mcron randomly stops running jobs

2019-08-30 Thread Robert Vollmert
This is the third time I’ve seen this, and this time I’m
sure that nothing else happened.

I have numerous mcron jobs configured, many of which run every 15 minutes,
and one which runs once a minute. Just now, at 19:22, I noticed the minutely
cron job didn’t seem to run, and saw that one every-15-minute job which always
logs to mcron.log had last run at 18:00. I restarted mcron, and jobs started
running again. There is no timezone confusion either: That every-15-minute job
logged its first run at 19:30.