Thank you sir for your guidance and quick response.

Let me introduce my colleagues Paul and Mikhail here (copied in CC). They would 
be taking actions based on your guidance in this email and may reach you with 
further queries.

Appreciate your support and help.

Thanks,
Atul

-----Original Message-----
From: Paul E. McKenney <paul...@kernel.org> 
Sent: 01 May 2020 00:47
To: Atul Kulkarni <atul.kulka...@katerra.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Need help on "Self Detected Stall on CPU"

On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
> 
> Hope you are doing well.  I have watched your various conference videos and 
> have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is 
> no one better who can guide us on how we can deal with it.
> 
> I have attached logs for your reference. Towards end I have run couple of 
> sysreq commands and have taken crash dump using sysreq which may help provide 
> additional information.
> Could you please guide us on how we could fix  this issue or identify what is 
> going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU [20526.351110]  
0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 
fqs=0
[20526.360163]   (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!"  If you don't let RCU's kernel threads run, then RCU CPU stall 
warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep for 
three jiffies.  As you can see from earlier in that same line, that was 2101 
jiffies ago.  The "->state=0x402" means that the scheduler believes that this 
kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a hardware 
configuration problem, a timer-driver bug, an interrupt-handling problem, and 
so on.  This sort of problem is especially common when bringing up new hardware 
or when modifying timer code or when modifying code on the interrupt/exception 
paths.

So the question to ask yourself is "Why is the timer wakeup not reaching this 
kthread?", with special attention to changed code and new hardware.

                                                        Thanx, Paul

Reply via email to