Re: panic: Stray timeout

2015-09-03 Thread Jean-Sébastien Pédron
On 03.09.2015 20:24, Andriy Gapon wrote:
> On 30/08/2015 22:05, Andriy Gapon wrote:
>> On 30/08/2015 21:37, Oliver Pinter wrote:
>>> You have a running Xorg with radeonkms driver, and this issue
>>> occurrence under high load (for example parallel buildworld)?
>>
>> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it 
>> was
>> involved.  But there was no steady system load before the crash, perhaps 
>> there
>> was a spike though.
>>
> 
> BTW: https://svnweb.freebsd.org/changeset/base/287354

Hi!

I saw that and built a new kernel earlier today. I will keep you posted
if something happens or not.

Thank you :)

-- 
Jean-Sébastien Pédron



signature.asc
Description: OpenPGP digital signature


Re: panic: Stray timeout

2015-09-03 Thread Andriy Gapon
On 30/08/2015 22:05, Andriy Gapon wrote:
> On 30/08/2015 21:37, Oliver Pinter wrote:
>> You have a running Xorg with radeonkms driver, and this issue
>> occurrence under high load (for example parallel buildworld)?
> 
> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it 
> was
> involved.  But there was no steady system load before the crash, perhaps there
> was a spike though.
> 

BTW: https://svnweb.freebsd.org/changeset/base/287354

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: Stray timeout

2015-09-03 Thread Andriy Gapon
On 31/08/2015 22:10, Hans Petter Selasky wrote:
> Hi Andriy,
> 
> On 08/31/15 18:46, Andriy Gapon wrote:
>> On 30/08/2015 22:09, Andriy Gapon wrote:
>>> On 30/08/2015 19:16, Konstantin Belousov wrote:
 This is strange, I do not think that could be a right explanation of this
 issue.  The taskqueue callout is initialized with the mutex, which means
 that the callout_stop() caller
 - must own the mutex;
 - is synchronous with the callout.
 In other words, callout cannot be running when taskqueue_cancel_timeout()
 calls callout_stop(), it only can be dequeued but the callout function
 is not yet called.  If callout_stop() is performed meantime, between
 dropping callout_cpu lock, and locking the mutex, it must be not run.
>>>
>>> Thank you for the explanation.  I am not familiar with the code and I
>>> misinterpreted the manual page and thought that callout_stop() might be 
>>> unable
>>> to stop the callout even if it was initialized with the mutex.  I see my 
>>> mistake
>>> now.
>>
>> I had to look at the code, of course.
>> Could the following logic actually be buggy?
>>
>> int
>> callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
>>  void (*ftn)(void *), void *arg, int cpu, int flags)
>> {
>> ...
>>  if (cc_exec_curr(cc, direct) == c) {
>>  /*
>>   * We're being asked to reschedule a callout which is
>>   * currently in progress.  If there is a lock then we
>>   * can cancel the callout if it has not really started.
>>   */
>>  if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
>>  cancelled = cc_exec_cancel(cc, direct) = true;
>>
>> I have a suspicion that the condition should check for !cc_exec_cancel:
>> - it seems that cc_exec_cancel is set to true when the callback execution 
>> starts
>> - logically, it wouldn't make sense to check if X is true and then set it to 
>> true
>>
>> The code is quite hard to understand in a short time, so I could be
>> misinterpreting what's going on, but I'd like to check my suspicion.
> 
> While touching this topic, is the following code more readable or not (see the
> callout_restart_async function):
> 
> https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=287248=markup

That code is a bit easier to read indeed.  But it's quite similar -
unsurprisingly, I guess.

>> If this is a bug indeed, then it might make the following scenario possible:
>>
>> - softclock() is being executed in thread T1 and the control flow is in
>> softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock)
>> - thread T2 is executing callout_reset_sbt_on on the same callout as T1
>> - so, I think, T2 will see cc_exec_cancel == false and thus, because of the
>> suspected bug, cc_exec_cancel will remain false
>> - T2 will proceed to callout_cc_add() and will happily (re-)schedule the 
>> callout
>> - after T2 releases the callout lock, T1 will continue in softclock_call_cc()
>> and because cc_exec_cancel is false the callout's callback function will get
>> called
> 
> Correct.
> 
>>
>> In effect, there will be an extra call of the callout.
>> If that's so that could explain the triggered assertion in
>> taskqueue_timeout_func().
>>
>> P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is 
>> not
>> set and I will see if I can get anything interesting from it.
>>
>> P.P.S. On a further look, it seems that there is a bug indeed and it's 
>> probably
>> just a typo that got introduced in r278469:
>> -   if (c->c_lock != NULL && 
>> !cc->cc_exec_entity[direct].cc_cancel)
>> -   cancelled = cc->cc_exec_entity[direct].cc_cancel = 
>> true;
>> -   if (cc->cc_exec_entity[direct].cc_waiting) {
>> +   if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
>> +   cancelled = cc_exec_cancel(cc, direct) = true;
>> +   if (cc_exec_waiting(cc, direct)) {
>>
> 
> Looks like a bug to me trying to make the code more readable removed a not in
> there!

Yes, indeed.  Thank you for reviewing!

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: Stray timeout

2015-08-31 Thread Andriy Gapon
On 30/08/2015 22:09, Andriy Gapon wrote:
> On 30/08/2015 19:16, Konstantin Belousov wrote:
>> This is strange, I do not think that could be a right explanation of this
>> issue.  The taskqueue callout is initialized with the mutex, which means
>> that the callout_stop() caller
>> - must own the mutex;
>> - is synchronous with the callout.
>> In other words, callout cannot be running when taskqueue_cancel_timeout()
>> calls callout_stop(), it only can be dequeued but the callout function
>> is not yet called.  If callout_stop() is performed meantime, between
>> dropping callout_cpu lock, and locking the mutex, it must be not run.
> 
> Thank you for the explanation.  I am not familiar with the code and I
> misinterpreted the manual page and thought that callout_stop() might be unable
> to stop the callout even if it was initialized with the mutex.  I see my 
> mistake
> now.

I had to look at the code, of course.
Could the following logic actually be buggy?

int
callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
void (*ftn)(void *), void *arg, int cpu, int flags)
{
...
if (cc_exec_curr(cc, direct) == c) {
/*
 * We're being asked to reschedule a callout which is
 * currently in progress.  If there is a lock then we
 * can cancel the callout if it has not really started.
 */
if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
cancelled = cc_exec_cancel(cc, direct) = true;

I have a suspicion that the condition should check for !cc_exec_cancel:
- it seems that cc_exec_cancel is set to true when the callback execution starts
- logically, it wouldn't make sense to check if X is true and then set it to 
true

The code is quite hard to understand in a short time, so I could be
misinterpreting what's going on, but I'd like to check my suspicion.

If this is a bug indeed, then it might make the following scenario possible:

- softclock() is being executed in thread T1 and the control flow is in
softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock)
- thread T2 is executing callout_reset_sbt_on on the same callout as T1
- so, I think, T2 will see cc_exec_cancel == false and thus, because of the
suspected bug, cc_exec_cancel will remain false
- T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout
- after T2 releases the callout lock, T1 will continue in softclock_call_cc()
and because cc_exec_cancel is false the callout's callback function will get 
called

In effect, there will be an extra call of the callout.
If that's so that could explain the triggered assertion in 
taskqueue_timeout_func().

P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not
set and I will see if I can get anything interesting from it.

P.P.S. On a further look, it seems that there is a bug indeed and it's probably
just a typo that got introduced in r278469:
-   if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel)
-   cancelled = cc->cc_exec_entity[direct].cc_cancel = true;
-   if (cc->cc_exec_entity[direct].cc_waiting) {
+   if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
+   cancelled = cc_exec_cancel(cc, direct) = true;
+   if (cc_exec_waiting(cc, direct)) {

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: Stray timeout

2015-08-31 Thread Hans Petter Selasky

Hi Andriy,

On 08/31/15 18:46, Andriy Gapon wrote:

On 30/08/2015 22:09, Andriy Gapon wrote:

On 30/08/2015 19:16, Konstantin Belousov wrote:

This is strange, I do not think that could be a right explanation of this
issue.  The taskqueue callout is initialized with the mutex, which means
that the callout_stop() caller
- must own the mutex;
- is synchronous with the callout.
In other words, callout cannot be running when taskqueue_cancel_timeout()
calls callout_stop(), it only can be dequeued but the callout function
is not yet called.  If callout_stop() is performed meantime, between
dropping callout_cpu lock, and locking the mutex, it must be not run.


Thank you for the explanation.  I am not familiar with the code and I
misinterpreted the manual page and thought that callout_stop() might be unable
to stop the callout even if it was initialized with the mutex.  I see my mistake
now.


I had to look at the code, of course.
Could the following logic actually be buggy?

int
callout_reset_sbt_on(struct callout *c, sbintime_t sbt, sbintime_t precision,
 void (*ftn)(void *), void *arg, int cpu, int flags)
{
...
 if (cc_exec_curr(cc, direct) == c) {
 /*
  * We're being asked to reschedule a callout which is
  * currently in progress.  If there is a lock then we
  * can cancel the callout if it has not really started.
  */
 if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
 cancelled = cc_exec_cancel(cc, direct) = true;

I have a suspicion that the condition should check for !cc_exec_cancel:
- it seems that cc_exec_cancel is set to true when the callback execution starts
- logically, it wouldn't make sense to check if X is true and then set it to 
true

The code is quite hard to understand in a short time, so I could be
misinterpreting what's going on, but I'd like to check my suspicion.


While touching this topic, is the following code more readable or not 
(see the callout_restart_async function):


https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?revision=287248=markup



If this is a bug indeed, then it might make the following scenario possible:

- softclock() is being executed in thread T1 and the control flow is in
softclock_call_cc() after CC_UNLOCK() and before lc_lock(c_lock)
- thread T2 is executing callout_reset_sbt_on on the same callout as T1
- so, I think, T2 will see cc_exec_cancel == false and thus, because of the
suspected bug, cc_exec_cancel will remain false
- T2 will proceed to callout_cc_add() and will happily (re-)schedule the callout
- after T2 releases the callout lock, T1 will continue in softclock_call_cc()
and because cc_exec_cancel is false the callout's callback function will get 
called


Correct.



In effect, there will be an extra call of the callout.
If that's so that could explain the triggered assertion in 
taskqueue_timeout_func().

P.S. I am going to add an assertion in callout_cc_add that CALLOUT_ACTIVE is not
set and I will see if I can get anything interesting from it.

P.P.S. On a further look, it seems that there is a bug indeed and it's probably
just a typo that got introduced in r278469:
-   if (c->c_lock != NULL && !cc->cc_exec_entity[direct].cc_cancel)
-   cancelled = cc->cc_exec_entity[direct].cc_cancel = true;
-   if (cc->cc_exec_entity[direct].cc_waiting) {
+   if (c->c_lock != NULL && cc_exec_cancel(cc, direct))
+   cancelled = cc_exec_cancel(cc, direct) = true;
+   if (cc_exec_waiting(cc, direct)) {



Looks like a bug to me trying to make the code more readable removed a 
not in there!


--HPS

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: panic: Stray timeout

2015-08-31 Thread Jean-Sébastien Pédron
On 30.08.2015 21:05, Andriy Gapon wrote:
> On 30/08/2015 21:37, Oliver Pinter wrote:
>> You have a running Xorg with radeonkms driver, and this issue
>> occurrence under high load (for example parallel buildworld)?
> 
> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it 
> was
> involved.  But there was no steady system load before the crash, perhaps there
> was a spike though.

I get this panic several times a week since around February this year.
It appeared without any commit to TTM or the Radeon driver. It happens
no matter the load for me.

So far, I never found the cause, even though I spent a lot of time
reading TTM, subr_taskqueue.c and kern_timeout.c.

So, this is a simple "me too" :-/

-- 
Jean-Sébastien Pédron



signature.asc
Description: OpenPGP digital signature


Re: panic: Stray timeout

2015-08-31 Thread Oliver Pinter
On 8/31/15, Jean-Sébastien Pédron  wrote:
> On 30.08.2015 21:05, Andriy Gapon wrote:
>> On 30/08/2015 21:37, Oliver Pinter wrote:
>>> You have a running Xorg with radeonkms driver, and this issue
>>> occurrence under high load (for example parallel buildworld)?
>>
>> I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue
>> it was
>> involved.  But there was no steady system load before the crash, perhaps
>> there
>> was a spike though.
>
> I get this panic several times a week since around February this year.
> It appeared without any commit to TTM or the Radeon driver. It happens
> no matter the load for me.

I got this panic, and I have somewhere a kernel and a coredump on my
other machine. I got this panic only when I put my machine under heavy
load and use X with randonkms driver.
When I didn't use X and start the build under vt, I newer got this panic.

>
> So far, I never found the cause, even though I spent a lot of time
> reading TTM, subr_taskqueue.c and kern_timeout.c.
>
> So, this is a simple "me too" :-/
>
> --
> Jean-Sébastien Pédron
>
>
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: panic: Stray timeout

2015-08-31 Thread Jean-Sébastien Pédron
On 31.08.2015 12:58, Oliver Pinter wrote:
> I got this panic, and I have somewhere a kernel and a coredump on my
> other machine. I got this panic only when I put my machine under heavy
> load and use X with randonkms driver.
> When I didn't use X and start the build under vt, I newer got this panic.

The function called by this taskqueue cleans up graphical buffer
objects. vt(4) relies on a single buffer object which doesn't change for
its entire life, that's why this TTM task is unused with vt(4).

However, buffer objects are heavily used by the X server and Mesa. The
task is executed several times a second on my laptop.

-- 
Jean-Sébastien Pédron



signature.asc
Description: OpenPGP digital signature


Re: panic: Stray timeout

2015-08-30 Thread Andriy Gapon
On 30/08/2015 21:37, Oliver Pinter wrote:
 You have a running Xorg with radeonkms driver, and this issue
 occurrence under high load (for example parallel buildworld)?

I use radonkms indeed and judging from ta_func = ttm_bo_delayed_workqueue it was
involved.  But there was no steady system load before the crash, perhaps there
was a spike though.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: panic: Stray timeout

2015-08-30 Thread Andriy Gapon
On 30/08/2015 19:16, Konstantin Belousov wrote:
 This is strange, I do not think that could be a right explanation of this
 issue.  The taskqueue callout is initialized with the mutex, which means
 that the callout_stop() caller
 - must own the mutex;
 - is synchronous with the callout.
 In other words, callout cannot be running when taskqueue_cancel_timeout()
 calls callout_stop(), it only can be dequeued but the callout function
 is not yet called.  If callout_stop() is performed meantime, between
 dropping callout_cpu lock, and locking the mutex, it must be not run.

Thank you for the explanation.  I am not familiar with the code and I
misinterpreted the manual page and thought that callout_stop() might be unable
to stop the callout even if it was initialized with the mutex.  I see my mistake
now.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: panic: Stray timeout

2015-08-30 Thread Oliver Pinter
On 8/30/15, Andriy Gapon a...@freebsd.org wrote:

 I've got the following kernel panic seemingly at random.
 I have been using the kernel for about a week without any issues and I
 wasn't
 doing anything special when the panic occurred.
 Does this panic ring any bells?  Could the problem be already fixed by more
 recent changes?

 r286985

 panic: Stray timeout

 (kgdb) bt
 #0  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:260
 #1  0x8063236f in kern_reboot (howto=260) at
 /usr/src/sys/kern/kern_shutdown.c:328
 #2  0x806329d4 in vpanic (fmt=value optimized out, ap=value
 optimized
 out) at /usr/src/sys/kern/kern_shutdown.c:508
 #3  0x806326d3 in panic (fmt=0x0) at
 /usr/src/sys/kern/kern_shutdown.c:441
 #4  0x80677dea in taskqueue_timeout_func (arg=value optimized out)
 at
 /usr/src/sys/kern/subr_taskqueue.c:269
 #5  0x8064858d in softclock_call_cc (c=0xfe000241cb50,
 cc=0x81066000, direct=0) at /usr/src/sys/kern/kern_timeout.c:722
 #6  0x80648917 in softclock (arg=value optimized out) at
 /usr/src/sys/kern/kern_timeout.c:851
 #7  0x805fe47f in intr_event_execute_handlers
 (p=0xf800059b0548,
 ie=0xf8000599d900) at /usr/src/sys/kern/kern_intr.c:1262
 #8  0x805fefac in ithread_execute_handlers (p=0xf800059b0548,
 ie=0xf8000599d900) at /usr/src/sys/kern/kern_intr.c:1275
 #9  0x805fee1b in ithread_loop (arg=value optimized out) at
 /usr/src/sys/kern/kern_intr.c:1356
 #10 0x805fba9b in fork_exit (callout=0x805fedc0
 ithread_loop,
 arg=0xf800059adb60, frame=0xfe01dc55bc00) at
 /usr/src/sys/kern/kern_fork.c:1006
 #11 0x808073de in fork_trampoline () at
 /usr/src/sys/libkern/explicit_bzero.c:28
 #12 0x in ?? ()

 (kgdb) fr 5
 #5  0x8064858d in softclock_call_cc (c=0xfe000241cb50,
 cc=0x81066000, direct=0) at /usr/src/sys/kern/kern_timeout.c:722
 722 c_func(c_arg);
 (kgdb) p *c
 $1 = {c_links = {le = {le_next = 0x0, le_prev = 0x81066108}, sle =
 {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x81066108}},
 c_time
 = 171699241227799, c_precision = 2684354, c_arg = 0xfe000241cb28,
   c_func = 0x80677db0 taskqueue_timeout_func, c_lock =
 0xf80013d66d30, c_flags = 2, c_iflags = 144, c_cpu = 0}

 c_flags = CALLOUT_ACTIVE
 c_iflags = CALLOUT_RETURNUNLOCKED | CALLOUT_PROCESSED

 (kgdb) p *cc
 $2 = {cc_lock = {lock_object = {lo_name = 0x809a0177 callout,
 lo_flags
 = 720896, lo_data = 0, lo_witness = 0xfe0001fd1400}, mtx_lock = 4},
 cc_exec_entity = 0x81066080, cc_next = 0x0,
   cc_callout = 0xfe0002116000, cc_callwheel = 0xfe0002238000,
 cc_expireq
 = {tqh_first = 0x0, tqh_last = 0x81066108}, cc_callfree = {slh_first
 =
 0xfe00022372c0}, cc_firstevent = 171699349138844,
   cc_lastscan = 171699243988142, cc_cookie = 0xf800059a9b00, cc_bucket
 =
 10456, cc_inited = 1, cc_ktr_event_name = 0x81066140 callwheel cpu
 0}
 (kgdb) p c_arg
 $3 = (void *) 0xfe000241cb28

 (kgdb) down
 #4  0x80677dea in taskqueue_timeout_func (arg=value optimized out)
 at
 /usr/src/sys/kern/subr_taskqueue.c:269
 269 KASSERT((timeout_task-f  DT_CALLOUT_ARMED) != 0, (Stray
 timeout));
 (kgdb) p *(struct timeout_task *)0xfe000241cb28
 $4 = {q = 0xf80013d66d00, t = {ta_link = {stqe_next = 0x0}, ta_pending =
 0,
 ta_priority = 0, ta_func = 0x82197220 ttm_bo_delayed_workqueue,
 ta_context = 0xfe000241c5c0}, c = {c_links = {le = {le_next = 0x0,
 le_prev = 0x81066108}, sle = {sle_next = 0x0}, tqe =
 {tqe_next =
 0x0, tqe_prev = 0x81066108}}, c_time = 171699241227799, c_precision
 =
 2684354, c_arg = 0xfe000241cb28,
 c_func = 0x80677db0 taskqueue_timeout_func, c_lock =
 0xf80013d66d30, c_flags = 2, c_iflags = 144, c_cpu = 0}, f = 0}
 (kgdb) p *$4.q
 $5 = {tq_queue = {stqh_first = 0x0, stqh_last = 0xf80013d66d00},
 tq_enqueue
 = 0x80678c30 taskqueue_thread_enqueue, tq_context =
 0x811130d8, tq_active = {tqh_first = 0x0, tqh_last =
 0xf80013d66d20},
   tq_mutex = {lock_object = {lo_name = 0x809a52a1 taskqueue,
 lo_flags
 = 16973824, lo_data = 0, lo_witness = 0xfe0001fd7200}, mtx_lock =
 18446735277710583024}, tq_threads = 0xf80013dd1bd0, tq_tcount = 1,
   tq_spin = 0, tq_flags = 5, tq_callouts = 1, tq_callbacks =
 0xf80013d66d68,
 tq_cb_contexts = 0xf80013d66d78}

 BTW, I see the following potential problem.  taskqueue_cancel_timeout()
 unconditionally resets DT_CALLOUT_ARMED, so that happens even if
 callout_stop()
 signals that the callout is going to be called.  In that case, when
 taskqueue_timeout_func() gets called it's going to run into the assertion.

You have a running Xorg with radeonkms driver, and this issue
occurrence under high load (for example parallel buildworld)?


 --
 Andriy Gapon
 ___