Re: panic: Stray timeout
On 03.09.2015 20:24, Andriy Gapon wrote: > On 30/08/2015 22:05, Andriy Gapon wrote: >> On 30/08/2015 21:37, Oliver Pinter wrote: >>> You have a running Xorg with radeonkms driver, and this issue >>> occurrence under high load (for example parallel buildworld)? >> >> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it >> was >> involved. But there was no steady system load before the crash, perhaps >> there >> was a spike though. >> > > BTW: https://svnweb.freebsd.org/changeset/base/287354 Hi! I saw that and built a new kernel earlier today. I will keep you posted if something happens or not. Thank you :) -- Jean-Sébastien Pédron signature.asc Description: OpenPGP digital signature
Re: panic: Stray timeout
On 30/08/2015 22:05, Andriy Gapon wrote: > On 30/08/2015 21:37, Oliver Pinter wrote: >> You have a running Xorg with radeonkms driver, and this issue >> occurrence under high load (for example parallel buildworld)? > > I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it > was > involved. But there was no steady system load before the crash, perhaps there > was a spike though. > BTW: https://svnweb.freebsd.org/changeset/base/287354 -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic: Stray timeout
On 31/08/2015 22:10, Hans Petter Selasky wrote: > Hi Andriy, > > On 08/31/15 18:46, Andriy Gapon wrote: >> On 30/08/2015 22:09, Andriy Gapon wrote: >>> On 30/08/2015 19:16, Konstantin Belousov wrote: This is strange, I do not think that could be a right explanation of this issue. The taskqueue callout is initialized with the mutex, which means that the callout_stop() caller - must own the mutex; - is synchronous with the callout. In other words, callout cannot be running when taskqueue_cancel_timeout() calls callout_stop(), it only can be dequeued but the callout function is not yet called. If callout_stop() is performed meantime, between dropping callout_cpu lock, and locking the mutex, it must be not run. >>> >>> Thank you for the explanation. I am not familiar with the code and I >>> misinterpreted the manual page and thought that callout_stop() might be >>> unable >>> to stop the callout even if it was initialized with the mutex. I see my >>> mistake >>> now. >> >> I had to look at the code, of course. >> Could the following logic actually be buggy? >> >> int >> callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision, >> void (*ftn)(void *), void *arg, int cpu, int flags) >> { >> ... >> if (cc_exec_curr(cc, direct) == c) { >> /* >> * We're being asked to reschedule a callout which is >> * currently in progress. If there is a lock then we >> * can cancel the callout if it has not really started. >> */ >> if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) >> cancelled = cc_exec_cancel(cc, direct) = true; >> >> I have a suspicion that the condition should check for !cc_exec_cancel: >> - it seems that cc_exec_cancel is set to true when the callback execution >> starts >> - logically, it wouldn't make sense to check if X is true and then set it to >> true >> >> The code is quite hard to understand in a short time, so I could be >> misinterpreting what's going on, but I'd like to check my suspicion. > > While touching this topic, is the following code more readable or not (see the > callout_restart_async function): > > https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=287248=markup That code is a bit easier to read indeed. But it's quite similar - unsurprisingly, I guess. >> If this is a bug indeed, then it might make the following scenario possible: >> >> - softclock() is being executed in thread T1 and the control flow is in >> softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock) >> - thread T2 is executing callout_reset_sbt_on on the same callout as T1 >> - so, I think, T2 will see cc_exec_cancel == false and thus, because of the >> suspected bug, cc_exec_cancel will remain false >> - T2 will proceed to callout_cc_add() and will happily (re-)schedule the >> callout >> - after T2 releases the callout lock, T1 will continue in softclock_call_cc() >> and because cc_exec_cancel is false the callout's callback function will get >> called > > Correct. > >> >> In effect, there will be an extra call of the callout. >> If that's so that could explain the triggered assertion in >> taskqueue_timeout_func(). >> >> P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is >> not >> set and I will see if I can get anything interesting from it. >> >> P.P.S. On a further look, it seems that there is a bug indeed and it's >> probably >> just a typo that got introduced in r278469: >> - if (c->c_lock != NULL && >> !cc->cc_exec_entity[direct].cc_cancel) >> - cancelled = cc->cc_exec_entity[direct].cc_cancel = >> true; >> - if (cc->cc_exec_entity[direct].cc_waiting) { >> + if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) >> + cancelled = cc_exec_cancel(cc, direct) = true; >> + if (cc_exec_waiting(cc, direct)) { >> > > Looks like a bug to me trying to make the code more readable removed a not in > there! Yes, indeed. Thank you for reviewing! -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic: Stray timeout
On 30/08/2015 22:09, Andriy Gapon wrote: > On 30/08/2015 19:16, Konstantin Belousov wrote: >> This is strange, I do not think that could be a right explanation of this >> issue. The taskqueue callout is initialized with the mutex, which means >> that the callout_stop() caller >> - must own the mutex; >> - is synchronous with the callout. >> In other words, callout cannot be running when taskqueue_cancel_timeout() >> calls callout_stop(), it only can be dequeued but the callout function >> is not yet called. If callout_stop() is performed meantime, between >> dropping callout_cpu lock, and locking the mutex, it must be not run. > > Thank you for the explanation. I am not familiar with the code and I > misinterpreted the manual page and thought that callout_stop() might be unable > to stop the callout even if it was initialized with the mutex. I see my > mistake > now. I had to look at the code, of course. Could the following logic actually be buggy? int callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision, void (*ftn)(void *), void *arg, int cpu, int flags) { ... if (cc_exec_curr(cc, direct) == c) { /* * We're being asked to reschedule a callout which is * currently in progress. If there is a lock then we * can cancel the callout if it has not really started. */ if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) cancelled = cc_exec_cancel(cc, direct) = true; I have a suspicion that the condition should check for !cc_exec_cancel: - it seems that cc_exec_cancel is set to true when the callback execution starts - logically, it wouldn't make sense to check if X is true and then set it to true The code is quite hard to understand in a short time, so I could be misinterpreting what's going on, but I'd like to check my suspicion. If this is a bug indeed, then it might make the following scenario possible: - softclock() is being executed in thread T1 and the control flow is in softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock) - thread T2 is executing callout_reset_sbt_on on the same callout as T1 - so, I think, T2 will see cc_exec_cancel == false and thus, because of the suspected bug, cc_exec_cancel will remain false - T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout - after T2 releases the callout lock, T1 will continue in softclock_call_cc() and because cc_exec_cancel is false the callout's callback function will get called In effect, there will be an extra call of the callout. If that's so that could explain the triggered assertion in taskqueue_timeout_func(). P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not set and I will see if I can get anything interesting from it. P.P.S. On a further look, it seems that there is a bug indeed and it's probably just a typo that got introduced in r278469: - if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel) - cancelled = cc->cc_exec_entity[direct].cc_cancel = true; - if (cc->cc_exec_entity[direct].cc_waiting) { + if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) + cancelled = cc_exec_cancel(cc, direct) = true; + if (cc_exec_waiting(cc, direct)) { -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic: Stray timeout
Hi Andriy, On 08/31/15 18:46, Andriy Gapon wrote: On 30/08/2015 22:09, Andriy Gapon wrote: On 30/08/2015 19:16, Konstantin Belousov wrote: This is strange, I do not think that could be a right explanation of this issue. The taskqueue callout is initialized with the mutex, which means that the callout_stop() caller - must own the mutex; - is synchronous with the callout. In other words, callout cannot be running when taskqueue_cancel_timeout() calls callout_stop(), it only can be dequeued but the callout function is not yet called. If callout_stop() is performed meantime, between dropping callout_cpu lock, and locking the mutex, it must be not run. Thank you for the explanation. I am not familiar with the code and I misinterpreted the manual page and thought that callout_stop() might be unable to stop the callout even if it was initialized with the mutex. I see my mistake now. I had to look at the code, of course. Could the following logic actually be buggy? int callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision, void (*ftn)(void *), void *arg, int cpu, int flags) { ... if (cc_exec_curr(cc, direct) == c) { /* * We're being asked to reschedule a callout which is * currently in progress. If there is a lock then we * can cancel the callout if it has not really started. */ if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) cancelled = cc_exec_cancel(cc, direct) = true; I have a suspicion that the condition should check for !cc_exec_cancel: - it seems that cc_exec_cancel is set to true when the callback execution starts - logically, it wouldn't make sense to check if X is true and then set it to true The code is quite hard to understand in a short time, so I could be misinterpreting what's going on, but I'd like to check my suspicion. While touching this topic, is the following code more readable or not (see the callout_restart_async function): https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=287248=markup If this is a bug indeed, then it might make the following scenario possible: - softclock() is being executed in thread T1 and the control flow is in softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock) - thread T2 is executing callout_reset_sbt_on on the same callout as T1 - so, I think, T2 will see cc_exec_cancel == false and thus, because of the suspected bug, cc_exec_cancel will remain false - T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout - after T2 releases the callout lock, T1 will continue in softclock_call_cc() and because cc_exec_cancel is false the callout's callback function will get called Correct. In effect, there will be an extra call of the callout. If that's so that could explain the triggered assertion in taskqueue_timeout_func(). P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not set and I will see if I can get anything interesting from it. P.P.S. On a further look, it seems that there is a bug indeed and it's probably just a typo that got introduced in r278469: - if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel) - cancelled = cc->cc_exec_entity[direct].cc_cancel = true; - if (cc->cc_exec_entity[direct].cc_waiting) { + if (c->c_lock != NULL && cc_exec_cancel(cc, direct)) + cancelled = cc_exec_cancel(cc, direct) = true; + if (cc_exec_waiting(cc, direct)) { Looks like a bug to me trying to make the code more readable removed a not in there! --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic: Stray timeout
On 30.08.2015 21:05, Andriy Gapon wrote: > On 30/08/2015 21:37, Oliver Pinter wrote: >> You have a running Xorg with radeonkms driver, and this issue >> occurrence under high load (for example parallel buildworld)? > > I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it > was > involved. But there was no steady system load before the crash, perhaps there > was a spike though. I get this panic several times a week since around February this year. It appeared without any commit to TTM or the Radeon driver. It happens no matter the load for me. So far, I never found the cause, even though I spent a lot of time reading TTM, subr_taskqueue.c and kern_timeout.c. So, this is a simple "me too" :-/ -- Jean-Sébastien Pédron signature.asc Description: OpenPGP digital signature
Re: panic: Stray timeout
On 8/31/15, Jean-Sébastien Pédronwrote: > On 30.08.2015 21:05, Andriy Gapon wrote: >> On 30/08/2015 21:37, Oliver Pinter wrote: >>> You have a running Xorg with radeonkms driver, and this issue >>> occurrence under high load (for example parallel buildworld)? >> >> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue >> it was >> involved. But there was no steady system load before the crash, perhaps >> there >> was a spike though. > > I get this panic several times a week since around February this year. > It appeared without any commit to TTM or the Radeon driver. It happens > no matter the load for me. I got this panic, and I have somewhere a kernel and a coredump on my other machine. I got this panic only when I put my machine under heavy load and use X with randonkms driver. When I didn't use X and start the build under vt, I newer got this panic. > > So far, I never found the cause, even though I spent a lot of time > reading TTM, subr_taskqueue.c and kern_timeout.c. > > So, this is a simple "me too" :-/ > > -- > Jean-Sébastien Pédron > > ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic: Stray timeout
On 31.08.2015 12:58, Oliver Pinter wrote: > I got this panic, and I have somewhere a kernel and a coredump on my > other machine. I got this panic only when I put my machine under heavy > load and use X with randonkms driver. > When I didn't use X and start the build under vt, I newer got this panic. The function called by this taskqueue cleans up graphical buffer objects. vt(4) relies on a single buffer object which doesn't change for its entire life, that's why this TTM task is unused with vt(4). However, buffer objects are heavily used by the X server and Mesa. The task is executed several times a second on my laptop. -- Jean-Sébastien Pédron signature.asc Description: OpenPGP digital signature
Re: panic: Stray timeout
On 30/08/2015 21:37, Oliver Pinter wrote: You have a running Xorg with radeonkms driver, and this issue occurrence under high load (for example parallel buildworld)? I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it was involved. But there was no steady system load before the crash, perhaps there was a spike though. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: panic: Stray timeout
On 30/08/2015 19:16, Konstantin Belousov wrote: This is strange, I do not think that could be a right explanation of this issue. The taskqueue callout is initialized with the mutex, which means that the callout_stop() caller - must own the mutex; - is synchronous with the callout. In other words, callout cannot be running when taskqueue_cancel_timeout() calls callout_stop(), it only can be dequeued but the callout function is not yet called. If callout_stop() is performed meantime, between dropping callout_cpu lock, and locking the mutex, it must be not run. Thank you for the explanation. I am not familiar with the code and I misinterpreted the manual page and thought that callout_stop() might be unable to stop the callout even if it was initialized with the mutex. I see my mistake now. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: panic: Stray timeout
On 8/30/15, Andriy Gapon a...@freebsd.org wrote: I've got the following kernel panic seemingly at random. I have been using the kernel for about a week without any issues and I wasn't doing anything special when the panic occurred. Does this panic ring any bells? Could the problem be already fixed by more recent changes? r286985 panic: Stray timeout (kgdb) bt #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:260 #1 0x8063236f in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:328 #2 0x806329d4 in vpanic (fmt=value optimized out, ap=value optimized out) at /usr/src/sys/kern/kern_shutdown.c:508 #3 0x806326d3 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:441 #4 0x80677dea in taskqueue_timeout_func (arg=value optimized out) at /usr/src/sys/kern/subr_taskqueue.c:269 #5 0x8064858d in softclock_call_cc (c=0xfe000241cb50, cc=0x81066000, direct=0) at /usr/src/sys/kern/kern_timeout.c:722 #6 0x80648917 in softclock (arg=value optimized out) at /usr/src/sys/kern/kern_timeout.c:851 #7 0x805fe47f in intr_event_execute_handlers (p=0xf800059b0548, ie=0xf8000599d900) at /usr/src/sys/kern/kern_intr.c:1262 #8 0x805fefac in ithread_execute_handlers (p=0xf800059b0548, ie=0xf8000599d900) at /usr/src/sys/kern/kern_intr.c:1275 #9 0x805fee1b in ithread_loop (arg=value optimized out) at /usr/src/sys/kern/kern_intr.c:1356 #10 0x805fba9b in fork_exit (callout=0x805fedc0 ithread_loop, arg=0xf800059adb60, frame=0xfe01dc55bc00) at /usr/src/sys/kern/kern_fork.c:1006 #11 0x808073de in fork_trampoline () at /usr/src/sys/libkern/explicit_bzero.c:28 #12 0x in ?? () (kgdb) fr 5 #5 0x8064858d in softclock_call_cc (c=0xfe000241cb50, cc=0x81066000, direct=0) at /usr/src/sys/kern/kern_timeout.c:722 722 c_func(c_arg); (kgdb) p *c $1 = {c_links = {le = {le_next = 0x0, le_prev = 0x81066108}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x81066108}}, c_time = 171699241227799, c_precision = 2684354, c_arg = 0xfe000241cb28, c_func = 0x80677db0 taskqueue_timeout_func, c_lock = 0xf80013d66d30, c_flags = 2, c_iflags = 144, c_cpu = 0} c_flags = CALLOUT_ACTIVE c_iflags = CALLOUT_RETURNUNLOCKED | CALLOUT_PROCESSED (kgdb) p *cc $2 = {cc_lock = {lock_object = {lo_name = 0x809a0177 callout, lo_flags = 720896, lo_data = 0, lo_witness = 0xfe0001fd1400}, mtx_lock = 4}, cc_exec_entity = 0x81066080, cc_next = 0x0, cc_callout = 0xfe0002116000, cc_callwheel = 0xfe0002238000, cc_expireq = {tqh_first = 0x0, tqh_last = 0x81066108}, cc_callfree = {slh_first = 0xfe00022372c0}, cc_firstevent = 171699349138844, cc_lastscan = 171699243988142, cc_cookie = 0xf800059a9b00, cc_bucket = 10456, cc_inited = 1, cc_ktr_event_name = 0x81066140 callwheel cpu 0} (kgdb) p c_arg $3 = (void *) 0xfe000241cb28 (kgdb) down #4 0x80677dea in taskqueue_timeout_func (arg=value optimized out) at /usr/src/sys/kern/subr_taskqueue.c:269 269 KASSERT((timeout_task-f DT_CALLOUT_ARMED) != 0, (Stray timeout)); (kgdb) p *(struct timeout_task *)0xfe000241cb28 $4 = {q = 0xf80013d66d00, t = {ta_link = {stqe_next = 0x0}, ta_pending = 0, ta_priority = 0, ta_func = 0x82197220 ttm_bo_delayed_workqueue, ta_context = 0xfe000241c5c0}, c = {c_links = {le = {le_next = 0x0, le_prev = 0x81066108}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x81066108}}, c_time = 171699241227799, c_precision = 2684354, c_arg = 0xfe000241cb28, c_func = 0x80677db0 taskqueue_timeout_func, c_lock = 0xf80013d66d30, c_flags = 2, c_iflags = 144, c_cpu = 0}, f = 0} (kgdb) p *$4.q $5 = {tq_queue = {stqh_first = 0x0, stqh_last = 0xf80013d66d00}, tq_enqueue = 0x80678c30 taskqueue_thread_enqueue, tq_context = 0x811130d8, tq_active = {tqh_first = 0x0, tqh_last = 0xf80013d66d20}, tq_mutex = {lock_object = {lo_name = 0x809a52a1 taskqueue, lo_flags = 16973824, lo_data = 0, lo_witness = 0xfe0001fd7200}, mtx_lock = 18446735277710583024}, tq_threads = 0xf80013dd1bd0, tq_tcount = 1, tq_spin = 0, tq_flags = 5, tq_callouts = 1, tq_callbacks = 0xf80013d66d68, tq_cb_contexts = 0xf80013d66d78} BTW, I see the following potential problem. taskqueue_cancel_timeout() unconditionally resets DT_CALLOUT_ARMED, so that happens even if callout_stop() signals that the callout is going to be called. In that case, when taskqueue_timeout_func() gets called it's going to run into the assertion. You have a running Xorg with radeonkms driver, and this issue occurrence under high load (for example parallel buildworld)? -- Andriy Gapon ___