Hey,

I'm trying to debug an issue on an embedded Linux (v4.4.107) where repeatedly and quickly suspending (mem suspend) and resuming the system causes one of the I2C controllers in the system to malfunction. I've run out of ideas and would like to know if anyone recognizes my issue or can provide clues to move forward with the debug.

Background: My system can be suspended and resumed using two buttons. The buttons are attached to a GPIO expander, which in turn is connected to the SoC via an I2C bus. The wake up button act as a wake-up source for the kernel. When a button is pressed, the GPIO expander tiggers an interrupt and the SoC will access the I2C bus to read out what button was pressed. If I mash the buttons like a 2-year old would, it'll eventually (within a minute or so) fail to suspend the system with an error from the kernel "PM: noirq suspend of devices failed". Just before this happens, I also see "controller timed out" errors coming from the I2C controller driver in the kernel log. The device that fails to suspend is the GPIO expander device and if I understand the kernel code correctly, it is because an IRQ arrived just at the moment when suspend is in progress. So it tries to process the IRQ before going to sleep, but fails because the I2C controller is no longer working, so it is unable to serve the IRQ and aborts suspend and the system is resumed. In a way this is correct behaviour, the kernel is going to sleep but receives an IRQ from the wake up source and then aborts the suspend. BUT, it does not explain why the controller gets timeouts and why it only happens sometimes. If I more gently suspend and resume (e.g no spamming of buttons), it works great.

What is odd is that once the system is resumed again, the I2C controller starts working again. But if I keep repeating the same procedure, the system is no longer able to suspend -- the fail to suspend happens every time and the system cannot go to sleep. Which is a disaster because this is a battery-powered device. What's even worse is that sometimes the GPIO expander stops working altogether, likely because it is a IRQF_ONESHOT irq and when we are unable to process the IRQs (due to broken I2C controller), it doesn't re-enable the IRQ anymore. I've been able to verify this by successfully sending i2c messages from the cli to the ADP5589 to poll its status, while IRQs from it is not arriving to IRQ handler.

For reference, the I2C controller I'm using is Designware I2C. The driver is drivers/i2c/busses/i2c-designware-*. The GPIO expander is a ADP5589 and the driver I'm using is drivers/input/keyboard/adp5589-keys.c. When the issue occurs, the controller timeout (https://elixir.bootlin.com/linux/v4.4.107/source/drivers/i2c/busses/i2c-designware-core.c#L659) happens because an ongoing I2C transmit (as requested by the ADP5589 irq handler) does not finish within 1 second.

I have connected a logic analyzer to the I2C pins and when the controller timeout happens, I see that both SDA and SCL are pulled low. They are kept low until the system is resumed and the controller recovers. At first I thought this issue was a i2c bus fault, so I tried implementing i2c bus recovery by remuxing the SDA and SCL pins to the GPIO controller and then pulsing the SCL. However, as soon as I remux the pins, the SCL and SDA are no longer getting pulled low. To me this indicates that it is not one of the slaves that are hogging the bus, it is the master. I can also tell from the controller status registers that when the controller timeout occurs, the controller is not in an idle state but it is also not getting the STOP bit interrupt nor anything that would "complete" the transfer. It's stuck. I have looked upstream in more recent kernels than 4.4 for fixes that would resolve this (and there are quite a few commits that mention "controlled timed out" for the designware driver), but so far nothing have worked.

Not even if I reset the whole controller (from the SoC syscontrol), it will work until the system is fully resumed. Queuing new transactions before system is suspended only makes the controller time out again. This makes me wonder: what other part of the system gets suspended that makes the i2c controller malfunction? And why does it not always happen? Is not the suspend sequence executed the same way every time? (e.g order of suspend)

Questions:

- If I call enable_irq_wake() on an IRQ, the IRQ should remain ON even if the system is suspended. Will the kernel ensure all parent devices are awaken before it invokes the device interrupt handler to serve a wake up IRQ?If I put printk's in the kernel suspend code, it seems to me that the ISR is called when more or less everything else is suspended / turned off.

- I've tried to modify the I2C controller driver so that it never goes to sleep, just as an experiment. I just set the PM ops to NULL and changed the request_irq flags to IRQF_NO_SUSPEND; is this sufficient to prevent the device from going to sleep?

If anyone have ideas on how to debug this issue, I'd greatly appreciate it.


Best regards, Magnus.



_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

Reply via email to