Re: hrtimer (rdmavt RNR timer) was lost

2018-04-24 Thread Frederic Weisbecker
On Tue, Apr 24, 2018 at 04:54:58PM +0200, Thomas Gleixner wrote:
> On Mon, 23 Apr 2018, Thomas Gleixner wrote:
> > On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > > > Can you apply the following debug patch and enable the hrtimer_start 
> > > > trace
> > > > point and send me the full trace or upload it somewhere?
> > > 
> > > The original trace was about 29GB and I filtered it with
> > > "66dda1ea" (the offending base) to generate a 1.4GB file that I
> > > could open and investigate.  I am not sure how I can send them to you. Do
> > > you have somewhere I can upload to?
> > >
> > > I can try your debug patch and again I am anticipating a big trace file.
> > 
> > Well, you can find the spot where the fail happens and then extract the
> > full thing from 2s before that point to 1s after. That should be reasonably
> > small and good enough. Let me know when you have it and how big it is
> > (compressed) and we'll figure something out how to transport it.
> 
> Thanks for the more complete data which actually let me decode the wreckage.
> 
> So you have NO_HZ enabled and looking at the trace then this is related:
> 
>   -0 [003] dN.. 3598605307236: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71217100 
> softexpires=71217100mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605307601: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=04346740
> 
>   -0 [002] d.h. 3598605313255: hrtimer_start: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
> next=04346740
>   -0 [002] d.h. 3599538740786: hrtimer_post_add: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=662d2dd8
> 
>   -0 [003] dn.. 3599538742242: hrtimer_pre_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
>   -0 [003] dn.. 3599538743084: hrtimer_post_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
>   
>   -0 [003] dn.. 3599538743830: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71676700 
> softexpires=71676700mode=ABS|PINNED base=66dda1ea 
> next=662d2dd8
>   -0 [003] dn.. 3599538744560: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
> 
> And staring at the NOHZ code I'm pretty sure, I know what happens.
> 
> CPU 3 CPU 2
> 
> idle
> start tick_sched_timer 71217100
>   start rdmavt timer 712172915662
>   lock(base)
> tick_nohz_stop_tick()
> tick = 71676700   timerqueue_add(tmr)
> 
> hrtimer_set_expires(>sched_timer, tick); < FAIL
> 
>   if (tmr->exp < 
> queue->next->exp)
> hrtimer_start(sched_timer)queue->next = tmr;
> lock(base)
>   unlock(base)
> timerqueue_remove()
> timerqueue_add()
> 
> 
> So ts->sched_timer is still queued and queue->next is pointing to it, but
> then expires is modified. So the other side sees the new expiry time and
> makes the rdmavt timer the new queue->next. All f*cked from there.
> 
> The problem was introduced with:
> 
>   d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get 
> out of sync")
> 
> The changelog is not really helpful, but I can't see why the expiry time of
> ts->sched_timer should be updated before the timer is [re]started in case
> of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be
> sufficient. Of course the tick needs to be stored in ts->sched_timer for
> the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against
> Linus tree below. It's easy to backport in case you want to run it against
> the kernel which made the observation simpler.
> 
> I need to come up with integration of hrtimer_set_expires() into debug
> objects to catch this kind of horrors. Groan

Sorry for all that, that must have been hairy enough to debug :-(

I thought that I could fiddle with the sched_timer because nothing else
is supposed to touch it concurrently, but that forgot about the whole
queue locked under cpu_base. My fault.

The fix looks all good to me, thanks!

ACK.


Re: hrtimer (rdmavt RNR timer) was lost

2018-04-24 Thread Frederic Weisbecker
On Tue, Apr 24, 2018 at 04:54:58PM +0200, Thomas Gleixner wrote:
> On Mon, 23 Apr 2018, Thomas Gleixner wrote:
> > On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > > > Can you apply the following debug patch and enable the hrtimer_start 
> > > > trace
> > > > point and send me the full trace or upload it somewhere?
> > > 
> > > The original trace was about 29GB and I filtered it with
> > > "66dda1ea" (the offending base) to generate a 1.4GB file that I
> > > could open and investigate.  I am not sure how I can send them to you. Do
> > > you have somewhere I can upload to?
> > >
> > > I can try your debug patch and again I am anticipating a big trace file.
> > 
> > Well, you can find the spot where the fail happens and then extract the
> > full thing from 2s before that point to 1s after. That should be reasonably
> > small and good enough. Let me know when you have it and how big it is
> > (compressed) and we'll figure something out how to transport it.
> 
> Thanks for the more complete data which actually let me decode the wreckage.
> 
> So you have NO_HZ enabled and looking at the trace then this is related:
> 
>   -0 [003] dN.. 3598605307236: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71217100 
> softexpires=71217100mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605307601: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=04346740
> 
>   -0 [002] d.h. 3598605313255: hrtimer_start: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
> next=04346740
>   -0 [002] d.h. 3599538740786: hrtimer_post_add: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=662d2dd8
> 
>   -0 [003] dn.. 3599538742242: hrtimer_pre_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
>   -0 [003] dn.. 3599538743084: hrtimer_post_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
>   
>   -0 [003] dn.. 3599538743830: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71676700 
> softexpires=71676700mode=ABS|PINNED base=66dda1ea 
> next=662d2dd8
>   -0 [003] dn.. 3599538744560: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=662d2dd8
> 
> And staring at the NOHZ code I'm pretty sure, I know what happens.
> 
> CPU 3 CPU 2
> 
> idle
> start tick_sched_timer 71217100
>   start rdmavt timer 712172915662
>   lock(base)
> tick_nohz_stop_tick()
> tick = 71676700   timerqueue_add(tmr)
> 
> hrtimer_set_expires(>sched_timer, tick); < FAIL
> 
>   if (tmr->exp < 
> queue->next->exp)
> hrtimer_start(sched_timer)queue->next = tmr;
> lock(base)
>   unlock(base)
> timerqueue_remove()
> timerqueue_add()
> 
> 
> So ts->sched_timer is still queued and queue->next is pointing to it, but
> then expires is modified. So the other side sees the new expiry time and
> makes the rdmavt timer the new queue->next. All f*cked from there.
> 
> The problem was introduced with:
> 
>   d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get 
> out of sync")
> 
> The changelog is not really helpful, but I can't see why the expiry time of
> ts->sched_timer should be updated before the timer is [re]started in case
> of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be
> sufficient. Of course the tick needs to be stored in ts->sched_timer for
> the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against
> Linus tree below. It's easy to backport in case you want to run it against
> the kernel which made the observation simpler.
> 
> I need to come up with integration of hrtimer_set_expires() into debug
> objects to catch this kind of horrors. Groan

Sorry for all that, that must have been hairy enough to debug :-(

I thought that I could fiddle with the sched_timer because nothing else
is supposed to touch it concurrently, but that forgot about the whole
queue locked under cpu_base. My fault.

The fix looks all good to me, thanks!

ACK.


RE: hrtimer (rdmavt RNR timer) was lost

2018-04-24 Thread Thomas Gleixner
On Mon, 23 Apr 2018, Thomas Gleixner wrote:
> On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > > Can you apply the following debug patch and enable the hrtimer_start trace
> > > point and send me the full trace or upload it somewhere?
> > 
> > The original trace was about 29GB and I filtered it with
> > "66dda1ea" (the offending base) to generate a 1.4GB file that I
> > could open and investigate.  I am not sure how I can send them to you. Do
> > you have somewhere I can upload to?
> >
> > I can try your debug patch and again I am anticipating a big trace file.
> 
> Well, you can find the spot where the fail happens and then extract the
> full thing from 2s before that point to 1s after. That should be reasonably
> small and good enough. Let me know when you have it and how big it is
> (compressed) and we'll figure something out how to transport it.

Thanks for the more complete data which actually let me decode the wreckage.

So you have NO_HZ enabled and looking at the trace then this is related:

  -0 [003] dN.. 3598605307236: hrtimer_start: 
hrtimer=04346740 function=tick_sched_timer expires=71217100 
softexpires=71217100mode=ABS|PINNED base=66dda1ea 
next=708914d7
  -0 [003] dN.. 3598605307601: hrtimer_post_add: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=04346740
  
  -0 [002] d.h. 3598605313255: hrtimer_start: 
hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
next=04346740
  -0 [002] d.h. 3599538740786: hrtimer_post_add: 
hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
base=66dda1ea next=662d2dd8

  -0 [003] dn.. 3599538742242: hrtimer_pre_remove: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8
  -0 [003] dn.. 3599538743084: hrtimer_post_remove: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8
  
  -0 [003] dn.. 3599538743830: hrtimer_start: 
hrtimer=04346740 function=tick_sched_timer expires=71676700 
softexpires=71676700mode=ABS|PINNED base=66dda1ea 
next=662d2dd8
  -0 [003] dn.. 3599538744560: hrtimer_post_add: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8

And staring at the NOHZ code I'm pretty sure, I know what happens.

CPU 3   CPU 2

idle
start tick_sched_timer 71217100
start rdmavt timer 712172915662
lock(base)
tick_nohz_stop_tick()
tick = 71676700 timerqueue_add(tmr)

hrtimer_set_expires(>sched_timer, tick); < FAIL

if (tmr->exp < 
queue->next->exp)
hrtimer_start(sched_timer)  queue->next = tmr;
lock(base)
unlock(base)
timerqueue_remove()
timerqueue_add()


So ts->sched_timer is still queued and queue->next is pointing to it, but
then expires is modified. So the other side sees the new expiry time and
makes the rdmavt timer the new queue->next. All f*cked from there.

The problem was introduced with:

  d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out 
of sync")

The changelog is not really helpful, but I can't see why the expiry time of
ts->sched_timer should be updated before the timer is [re]started in case
of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be
sufficient. Of course the tick needs to be stored in ts->sched_timer for
the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against
Linus tree below. It's easy to backport in case you want to run it against
the kernel which made the observation simpler.

I need to come up with integration of hrtimer_set_expires() into debug
objects to catch this kind of horrors. Groan

Thanks,

tglx

8<---
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -804,12 +804,12 @@ static void tick_nohz_stop_tick(struct t
return;
}
 
-   hrtimer_set_expires(>sched_timer, tick);
-
-   if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
-   hrtimer_start_expires(>sched_timer, 
HRTIMER_MODE_ABS_PINNED);
-   else
+   if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
+   hrtimer_start(>sched_timer, tick, HRTIMER_MODE_ABS_PINNED);
+   } else {
+   hrtimer_set_expires(>sched_timer, tick);
tick_program_event(tick, 1);
+   }
 }
 
 static void tick_nohz_retain_tick(struct tick_sched *ts)


RE: hrtimer (rdmavt RNR timer) was lost

2018-04-24 Thread Thomas Gleixner
On Mon, 23 Apr 2018, Thomas Gleixner wrote:
> On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > > Can you apply the following debug patch and enable the hrtimer_start trace
> > > point and send me the full trace or upload it somewhere?
> > 
> > The original trace was about 29GB and I filtered it with
> > "66dda1ea" (the offending base) to generate a 1.4GB file that I
> > could open and investigate.  I am not sure how I can send them to you. Do
> > you have somewhere I can upload to?
> >
> > I can try your debug patch and again I am anticipating a big trace file.
> 
> Well, you can find the spot where the fail happens and then extract the
> full thing from 2s before that point to 1s after. That should be reasonably
> small and good enough. Let me know when you have it and how big it is
> (compressed) and we'll figure something out how to transport it.

Thanks for the more complete data which actually let me decode the wreckage.

So you have NO_HZ enabled and looking at the trace then this is related:

  -0 [003] dN.. 3598605307236: hrtimer_start: 
hrtimer=04346740 function=tick_sched_timer expires=71217100 
softexpires=71217100mode=ABS|PINNED base=66dda1ea 
next=708914d7
  -0 [003] dN.. 3598605307601: hrtimer_post_add: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=04346740
  
  -0 [002] d.h. 3598605313255: hrtimer_start: 
hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
next=04346740
  -0 [002] d.h. 3599538740786: hrtimer_post_add: 
hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
base=66dda1ea next=662d2dd8

  -0 [003] dn.. 3599538742242: hrtimer_pre_remove: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8
  -0 [003] dn.. 3599538743084: hrtimer_post_remove: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8
  
  -0 [003] dn.. 3599538743830: hrtimer_start: 
hrtimer=04346740 function=tick_sched_timer expires=71676700 
softexpires=71676700mode=ABS|PINNED base=66dda1ea 
next=662d2dd8
  -0 [003] dn.. 3599538744560: hrtimer_post_add: 
hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
next=662d2dd8

And staring at the NOHZ code I'm pretty sure, I know what happens.

CPU 3   CPU 2

idle
start tick_sched_timer 71217100
start rdmavt timer 712172915662
lock(base)
tick_nohz_stop_tick()
tick = 71676700 timerqueue_add(tmr)

hrtimer_set_expires(>sched_timer, tick); < FAIL

if (tmr->exp < 
queue->next->exp)
hrtimer_start(sched_timer)  queue->next = tmr;
lock(base)
unlock(base)
timerqueue_remove()
timerqueue_add()


So ts->sched_timer is still queued and queue->next is pointing to it, but
then expires is modified. So the other side sees the new expiry time and
makes the rdmavt timer the new queue->next. All f*cked from there.

The problem was introduced with:

  d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out 
of sync")

The changelog is not really helpful, but I can't see why the expiry time of
ts->sched_timer should be updated before the timer is [re]started in case
of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be
sufficient. Of course the tick needs to be stored in ts->sched_timer for
the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against
Linus tree below. It's easy to backport in case you want to run it against
the kernel which made the observation simpler.

I need to come up with integration of hrtimer_set_expires() into debug
objects to catch this kind of horrors. Groan

Thanks,

tglx

8<---
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -804,12 +804,12 @@ static void tick_nohz_stop_tick(struct t
return;
}
 
-   hrtimer_set_expires(>sched_timer, tick);
-
-   if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
-   hrtimer_start_expires(>sched_timer, 
HRTIMER_MODE_ABS_PINNED);
-   else
+   if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
+   hrtimer_start(>sched_timer, tick, HRTIMER_MODE_ABS_PINNED);
+   } else {
+   hrtimer_set_expires(>sched_timer, tick);
tick_program_event(tick, 1);
+   }
 }
 
 static void tick_nohz_retain_tick(struct tick_sched *ts)


RE: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Thomas Gleixner
On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > Can you apply the following debug patch and enable the hrtimer_start trace
> > point and send me the full trace or upload it somewhere?
> 
> The original trace was about 29GB and I filtered it with
> "66dda1ea" (the offending base) to generate a 1.4GB file that I
> could open and investigate.  I am not sure how I can send them to you. Do
> you have somewhere I can upload to?
>
> I can try your debug patch and again I am anticipating a big trace file.

Well, you can find the spot where the fail happens and then extract the
full thing from 2s before that point to 1s after. That should be reasonably
small and good enough. Let me know when you have it and how big it is
(compressed) and we'll figure something out how to transport it.

Thanks,

tglx



RE: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Thomas Gleixner
On Mon, 23 Apr 2018, Wan, Kaike wrote:
> > Can you apply the following debug patch and enable the hrtimer_start trace
> > point and send me the full trace or upload it somewhere?
> 
> The original trace was about 29GB and I filtered it with
> "66dda1ea" (the offending base) to generate a 1.4GB file that I
> could open and investigate.  I am not sure how I can send them to you. Do
> you have somewhere I can upload to?
>
> I can try your debug patch and again I am anticipating a big trace file.

Well, you can find the spot where the fail happens and then extract the
full thing from 2s before that point to 1s after. That should be reasonably
small and good enough. Let me know when you have it and how big it is
(compressed) and we'll figure something out how to transport it.

Thanks,

tglx



RE: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Wan, Kaike


> -Original Message-
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Monday, April 23, 2018 8:53 AM
> To: Wan, Kaike <kaike@intel.com>
> Cc: Marciniszyn, Mike <mike.marcinis...@intel.com>; Dalessandro, Dennis
> <dennis.dalessan...@intel.com>; Weiny, Ira <ira.we...@intel.com>; Fleck,
> John <john.fl...@intel.com>; linux-kernel@vger.kernel.org; linux-
> r...@vger.kernel.org
> Subject: Re: hrtimer (rdmavt RNR timer) was lost
> 
> On Fri, 20 Apr 2018, Wan, Kaike wrote:
> 
> Please do not send html mail. It's rejected by the list server.
> 
> >   -0 [001] d.h. 3598473310671: hrtimer_start:
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172295802 softexpires=712172295802mode=REL
> base=66dda1ea next=39cae92b
> >   -0 [001] d.h. 3598473311561: hrtimer_post_add:
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt]
> base=66dda1ea next=39cae92b
> >
> 
> > --> The target was first added. The expires was 712172295802. The call
> > was made from cpu 1 while the timer base was on cpu 3.
> 
> >   -0 [003] dn.. 3598605305357: hrtimer_start:
> hrtimer=04346740 function=tick_sched_timer
> expires=71676700 softexpires=71676700mode=ABS|PINNED
> base=66dda1ea next=708914d7
> >   -0 [003] dn.. 3598605305781: hrtimer_post_add:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >
> > --> the target timer was the next one to trigger. The call was from cpu 3.
> >
> >   -0 [003] dN.. 3598605306604: hrtimer_pre_remove:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >   -0 [003] dN.. 3598605306802: hrtimer_post_remove:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >
> > --> The tick_sched_timer was removed
> >
> >   -0 [003] dN.. 3598605307236: hrtimer_start:
> hrtimer=04346740 function=tick_sched_timer
> expires=71217100 softexpires=71217100mode=ABS|PINNED
> base=66dda1ea next=708914d7
> >   -0 [003] dN.. 3598605307601: hrtimer_post_add:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=04346740
> >
> > --> The tick_sched_timer was added again with a new expires
> > --> (71217100
> > < 71676700), which replaced the target (712172295802) as the
> > head->next.
> >
> >   -0 [002] d.h. 3598605313255: hrtimer_start:
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172915662 softexpires=712172915662mode=REL
> base=66dda1ea next=04346740
> >   -0 [002] d.h. 3599538740786: hrtimer_post_add:
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> base=66dda1ea next=662d2dd8
> >
> >
> > --> A new hrtimer (662d2dd8) was added (from cpu 2). Its
> > --> expires
> > (712172915662) was larger than the target's (712172295802). How could
> > it become the next hrtimer to trigger??? The next pointer was not set
> > properly here.
> 
> I have no idea how that can happen assumed that you did not strip anything
> from the trace. If you stripped stuff, please send me the full trace or upload
> it somewhere.
> 
> Can you apply the following debug patch and enable the hrtimer_start trace
> point and send me the full trace or upload it somewhere?

The original trace was about 29GB and I filtered it with  "66dda1ea" 
(the offending base) to generate a 1.4GB  file that I could open and 
investigate.  I am not sure how I can send them to you. Do you have somewhere I 
can upload to?

I can try your debug patch and again I am anticipating a big trace file.

Kaike

> 
> Thanks,
> 
>   tglx
> 
> 8<-
> --- a/lib/timerqueue.c
> +++ b/lib/timerqueue.c
> @@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he
>   rb_link_node(>node, parent, p);
>   rb_insert_color(>node, >head);
> 
> + trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +  head, head->next, head->next ? head->next->expires :
> 0ULL,
> +  node, node->expires);
> +
>   if (!head->next || node->expires < head->next->expires) {
>   head->next = node;
>   return true;
> @@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he
> 
>   head->next = rb_entry_safe(rbn, struct timerqueue_node,
> node);
>   }
> +
> + trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +  head, head->next, head->next ? head->next->expires :
> 0ULL,
> +  node, node->expires);
> +
>   rb_erase(>node, >head);
>   RB_CLEAR_NODE(>node);
>   return head->next != NULL;


RE: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Wan, Kaike


> -Original Message-
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Monday, April 23, 2018 8:53 AM
> To: Wan, Kaike 
> Cc: Marciniszyn, Mike ; Dalessandro, Dennis
> ; Weiny, Ira ; Fleck,
> John ; linux-kernel@vger.kernel.org; linux-
> r...@vger.kernel.org
> Subject: Re: hrtimer (rdmavt RNR timer) was lost
> 
> On Fri, 20 Apr 2018, Wan, Kaike wrote:
> 
> Please do not send html mail. It's rejected by the list server.
> 
> >   -0 [001] d.h. 3598473310671: hrtimer_start:
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172295802 softexpires=712172295802mode=REL
> base=66dda1ea next=39cae92b
> >   -0 [001] d.h. 3598473311561: hrtimer_post_add:
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt]
> base=66dda1ea next=39cae92b
> >
> 
> > --> The target was first added. The expires was 712172295802. The call
> > was made from cpu 1 while the timer base was on cpu 3.
> 
> >   -0 [003] dn.. 3598605305357: hrtimer_start:
> hrtimer=04346740 function=tick_sched_timer
> expires=71676700 softexpires=71676700mode=ABS|PINNED
> base=66dda1ea next=708914d7
> >   -0 [003] dn.. 3598605305781: hrtimer_post_add:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >
> > --> the target timer was the next one to trigger. The call was from cpu 3.
> >
> >   -0 [003] dN.. 3598605306604: hrtimer_pre_remove:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >   -0 [003] dN.. 3598605306802: hrtimer_post_remove:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=708914d7
> >
> > --> The tick_sched_timer was removed
> >
> >   -0 [003] dN.. 3598605307236: hrtimer_start:
> hrtimer=04346740 function=tick_sched_timer
> expires=71217100 softexpires=71217100mode=ABS|PINNED
> base=66dda1ea next=708914d7
> >   -0 [003] dN.. 3598605307601: hrtimer_post_add:
> hrtimer=04346740 function=tick_sched_timer
> base=66dda1ea next=04346740
> >
> > --> The tick_sched_timer was added again with a new expires
> > --> (71217100
> > < 71676700), which replaced the target (712172295802) as the
> > head->next.
> >
> >   -0 [002] d.h. 3598605313255: hrtimer_start:
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> expires=712172915662 softexpires=712172915662mode=REL
> base=66dda1ea next=04346740
> >   -0 [002] d.h. 3599538740786: hrtimer_post_add:
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt]
> base=66dda1ea next=662d2dd8
> >
> >
> > --> A new hrtimer (662d2dd8) was added (from cpu 2). Its
> > --> expires
> > (712172915662) was larger than the target's (712172295802). How could
> > it become the next hrtimer to trigger??? The next pointer was not set
> > properly here.
> 
> I have no idea how that can happen assumed that you did not strip anything
> from the trace. If you stripped stuff, please send me the full trace or upload
> it somewhere.
> 
> Can you apply the following debug patch and enable the hrtimer_start trace
> point and send me the full trace or upload it somewhere?

The original trace was about 29GB and I filtered it with  "66dda1ea" 
(the offending base) to generate a 1.4GB  file that I could open and 
investigate.  I am not sure how I can send them to you. Do you have somewhere I 
can upload to?

I can try your debug patch and again I am anticipating a big trace file.

Kaike

> 
> Thanks,
> 
>   tglx
> 
> 8<-
> --- a/lib/timerqueue.c
> +++ b/lib/timerqueue.c
> @@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he
>   rb_link_node(>node, parent, p);
>   rb_insert_color(>node, >head);
> 
> + trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +  head, head->next, head->next ? head->next->expires :
> 0ULL,
> +  node, node->expires);
> +
>   if (!head->next || node->expires < head->next->expires) {
>   head->next = node;
>   return true;
> @@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he
> 
>   head->next = rb_entry_safe(rbn, struct timerqueue_node,
> node);
>   }
> +
> + trace_printk("head %p next %p next->exp %llu node %p node-
> >exp %llu\n",
> +  head, head->next, head->next ? head->next->expires :
> 0ULL,
> +  node, node->expires);
> +
>   rb_erase(>node, >head);
>   RB_CLEAR_NODE(>node);
>   return head->next != NULL;


Re: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Thomas Gleixner
On Fri, 20 Apr 2018, Wan, Kaike wrote:

Please do not send html mail. It's rejected by the list server.

>   -0 [001] d.h. 3598473310671: hrtimer_start: 
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172295802 softexpires=712172295802mode=REL base=66dda1ea 
> next=39cae92b
>   -0 [001] d.h. 3598473311561: hrtimer_post_add: 
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=39cae92b
> 

> --> The target was first added. The expires was 712172295802. The call
> was made from cpu 1 while the timer base was on cpu 3.

>   -0 [003] dn.. 3598605305357: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71676700 
> softexpires=71676700mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dn.. 3598605305781: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
> 
> --> the target timer was the next one to trigger. The call was from cpu 3.
> 
>   -0 [003] dN.. 3598605306604: hrtimer_pre_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605306802: hrtimer_post_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
> 
> --> The tick_sched_timer was removed
> 
>   -0 [003] dN.. 3598605307236: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71217100 
> softexpires=71217100mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605307601: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=04346740
> 
> --> The tick_sched_timer was added again with a new expires (71217100
> < 71676700), which replaced the target (712172295802) as the
> head->next.
> 
>   -0 [002] d.h. 3598605313255: hrtimer_start: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
> next=04346740
>   -0 [002] d.h. 3599538740786: hrtimer_post_add: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=662d2dd8
>
> 
> --> A new hrtimer (662d2dd8) was added (from cpu 2). Its expires
> (712172915662) was larger than the target's (712172295802). How could
> it become the next hrtimer to trigger??? The next pointer was not set
> properly here.

I have no idea how that can happen assumed that you did not strip anything
from the trace. If you stripped stuff, please send me the full trace or
upload it somewhere.

Can you apply the following debug patch and enable the hrtimer_start trace
point and send me the full trace or upload it somewhere?

Thanks,

tglx

8<-
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he
rb_link_node(>node, parent, p);
rb_insert_color(>node, >head);
 
+   trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n",
+head, head->next, head->next ? head->next->expires : 0ULL,
+node, node->expires);
+
if (!head->next || node->expires < head->next->expires) {
head->next = node;
return true;
@@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he
 
head->next = rb_entry_safe(rbn, struct timerqueue_node, node);
}
+
+   trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n",
+head, head->next, head->next ? head->next->expires : 0ULL,
+node, node->expires);
+
rb_erase(>node, >head);
RB_CLEAR_NODE(>node);
return head->next != NULL;


Re: hrtimer (rdmavt RNR timer) was lost

2018-04-23 Thread Thomas Gleixner
On Fri, 20 Apr 2018, Wan, Kaike wrote:

Please do not send html mail. It's rejected by the list server.

>   -0 [001] d.h. 3598473310671: hrtimer_start: 
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172295802 softexpires=712172295802mode=REL base=66dda1ea 
> next=39cae92b
>   -0 [001] d.h. 3598473311561: hrtimer_post_add: 
> hrtimer=708914d7 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=39cae92b
> 

> --> The target was first added. The expires was 712172295802. The call
> was made from cpu 1 while the timer base was on cpu 3.

>   -0 [003] dn.. 3598605305357: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71676700 
> softexpires=71676700mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dn.. 3598605305781: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
> 
> --> the target timer was the next one to trigger. The call was from cpu 3.
> 
>   -0 [003] dN.. 3598605306604: hrtimer_pre_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605306802: hrtimer_post_remove: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=708914d7
> 
> --> The tick_sched_timer was removed
> 
>   -0 [003] dN.. 3598605307236: hrtimer_start: 
> hrtimer=04346740 function=tick_sched_timer expires=71217100 
> softexpires=71217100mode=ABS|PINNED base=66dda1ea 
> next=708914d7
>   -0 [003] dN.. 3598605307601: hrtimer_post_add: 
> hrtimer=04346740 function=tick_sched_timer base=66dda1ea 
> next=04346740
> 
> --> The tick_sched_timer was added again with a new expires (71217100
> < 71676700), which replaced the target (712172295802) as the
> head->next.
> 
>   -0 [002] d.h. 3598605313255: hrtimer_start: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> expires=712172915662 softexpires=712172915662mode=REL base=66dda1ea 
> next=04346740
>   -0 [002] d.h. 3599538740786: hrtimer_post_add: 
> hrtimer=662d2dd8 function=rvt_rc_rnr_retry [rdmavt] 
> base=66dda1ea next=662d2dd8
>
> 
> --> A new hrtimer (662d2dd8) was added (from cpu 2). Its expires
> (712172915662) was larger than the target's (712172295802). How could
> it become the next hrtimer to trigger??? The next pointer was not set
> properly here.

I have no idea how that can happen assumed that you did not strip anything
from the trace. If you stripped stuff, please send me the full trace or
upload it somewhere.

Can you apply the following debug patch and enable the hrtimer_start trace
point and send me the full trace or upload it somewhere?

Thanks,

tglx

8<-
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -57,6 +57,10 @@ bool timerqueue_add(struct timerqueue_he
rb_link_node(>node, parent, p);
rb_insert_color(>node, >head);
 
+   trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n",
+head, head->next, head->next ? head->next->expires : 0ULL,
+node, node->expires);
+
if (!head->next || node->expires < head->next->expires) {
head->next = node;
return true;
@@ -84,6 +88,11 @@ bool timerqueue_del(struct timerqueue_he
 
head->next = rb_entry_safe(rbn, struct timerqueue_node, node);
}
+
+   trace_printk("head %p next %p next->exp %llu node %p node->exp %llu\n",
+head, head->next, head->next ? head->next->expires : 0ULL,
+node, node->expires);
+
rb_erase(>node, >head);
RB_CLEAR_NODE(>node);
return head->next != NULL;