Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi, On Wed, Jun 12, 2019 at 06:55:28AM +0100, Maciej W. Rozycki wrote: > On Tue, 11 Jun 2019, Aaro Koskinen wrote: > > > However, with your patch the "nobody cared" is never reached so all is > > good. I tried 10 boots with the patch and all were successful. Without > > the patch 8 out of 10 failed with the "nobody cared" warning. > > I wouldn't call it "good", just less obvious or painful. This is still > causing wasted CPU cycles that are used for taking the phantom interrupts. > > There is clearly a completion barrier missing somewhere that causes the > interrupt request to linger beyond the point interrupts are reenabled at > the CPU. > > One way to attempt to narrow it down might be taking a backtrace from > where IRQ 14 is found to be spurious. This would indicate the offending > interrupt unmask action. E.g. I see no explicit completion barrier The first spurious IRQ is right after the driver registers: [4.732000] [] show_stack+0x90/0x140 [4.732000] [] ata_bmdma_interrupt+0x2b4/0x39c [4.732000] [] __handle_irq_event_percpu+0xb0/0x178 [4.732000] [] handle_irq_event_percpu+0x34/0x9c [4.732000] [] handle_irq_event+0x3c/0x74 [4.732000] [] handle_level_irq+0x118/0x154 [4.732000] [] generic_handle_irq+0x34/0x50 [4.732000] [] do_IRQ+0x18/0x24 [4.732000] [] handle_int+0x17c/0x188 [4.732000] [] arch_local_irq_restore+0x18/0x30 [4.732000] [] __setup_irq+0x660/0x7a0 [4.732000] [] request_threaded_irq+0x114/0x19c [4.732000] [] devm_request_threaded_irq+0xa0/0x10c [4.732000] [] ata_pci_sff_activate_host+0x1c0/0x274 [4.732000] [] ata_pci_init_one+0x170/0x1c4 [4.732000] [] cs5536_init_one+0x94/0xb8 and the following ones do not seem to provide much info as I can only see the IRQ stack: [4.736000] [] show_stack+0x90/0x140 [4.736000] [] ata_bmdma_interrupt+0x2b4/0x39c [4.736000] [] __handle_irq_event_percpu+0xb0/0x178 [4.736000] [] handle_irq_event_percpu+0x34/0x9c [4.736000] [] handle_irq_event+0x3c/0x74 [4.736000] [] handle_level_irq+0x118/0x154 [4.736000] [] generic_handle_irq+0x34/0x50 [4.736000] [] do_IRQ+0x18/0x24 [4.736000] [] handle_int+0x17c/0x188 [4.736000] [] irq_exit+0x68/0xcc > between the final `outb' in `mask_and_ack_8259A' and the following call to > `raw_spin_unlock_irqrestore', which are obviously otherwise unordered WRT > each other (because `outb' is I/O or MMIO and `raw_spin_unlock_irqrestore' > is contained within the CPU on UP). I can see provisions however for > issuing an architecture-specific barrier in `do_raw_spin_unlock', which is > the workhorse for `raw_spin_unlock_irqrestore', so maybe this is the place > to look into? > > Also how's IRQ 14 registered as indicated by /proc/interrupts? Not sure what you mean but here's the output: $ cat /proc/interrupts CPU0 2: 0XT-PIC 2 cascade 3: 20XT-PIC 3 ttyS0 5: 543358XT-PIC 5 timer 11: 0XT-PIC 11 ehci_hcd:usb1, ohci_hcd:usb2 14: 10XT-PIC 14 pata_cs5536 18: 0 MIPS 2 cascade 22: 0 MIPS 6 cascade 36: 3052 bonito_irq eth0 ERR: 0 A.
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Tue, 11 Jun 2019, Aaro Koskinen wrote: > However, with your patch the "nobody cared" is never reached so all is > good. I tried 10 boots with the patch and all were successful. Without > the patch 8 out of 10 failed with the "nobody cared" warning. I wouldn't call it "good", just less obvious or painful. This is still causing wasted CPU cycles that are used for taking the phantom interrupts. There is clearly a completion barrier missing somewhere that causes the interrupt request to linger beyond the point interrupts are reenabled at the CPU. One way to attempt to narrow it down might be taking a backtrace from where IRQ 14 is found to be spurious. This would indicate the offending interrupt unmask action. E.g. I see no explicit completion barrier between the final `outb' in `mask_and_ack_8259A' and the following call to `raw_spin_unlock_irqrestore', which are obviously otherwise unordered WRT each other (because `outb' is I/O or MMIO and `raw_spin_unlock_irqrestore' is contained within the CPU on UP). I can see provisions however for issuing an architecture-specific barrier in `do_raw_spin_unlock', which is the workhorse for `raw_spin_unlock_irqrestore', so maybe this is the place to look into? Also how's IRQ 14 registered as indicated by /proc/interrupts? Maciej
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi, On Sun, May 26, 2019 at 06:19:00AM -0300, Alexandre Oliva wrote: > On Mar 8, 2019, "Maciej W. Rozycki" wrote: > > > Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system? > > No, it expands to `barrier' on the yeeloong: > > CONFIG_CPU_LOONGSON2F=y > CONFIG_CPU_LOONGSON2F_WORKAROUNDS=y > CONFIG_CPU_LOONGSON2=y > CONFIG_SYS_HAS_CPU_LOONGSON2F=y > > > I've finally managed to do the bisection on object files I mentioned I'd > do to try to pinpoint where __BUILT_IOPORT_PFX with barrier rather than > !barrier regressed. > > I found that forcing barrier off for drivers/irqchip/irq-i8259 was > enough to avoid the problem. > > (I further narrowed it down to byte I/O, which is no surprise > considering irq-i8259 doesn't seem to use any non-byte I/O.) > > Then I narrowed it down to output only. > > A Loongson2F kernel built with the patch below works at normal speed. > I've also keyed the -1 barrier selector to compiling the irq-i8259 > driver only, and that got me a functional kernel, but I'm not confident > that the same issues that affect the interrupt controller, preventing it > from initializing correctly, is not also affecting other drivers, just > in less visible ways, so the patch conservatively reverts to the older > barriers for all I/O (i.e., non-mem) out primitives. > > I've tested this on a yeeloong on top of v5.1.5. > > I'm tempted to start using this patch in my Freeloong builds of GNU > Linux-libre for gnewsense/yeeloong of 5.0 and 5.1 stable releases, to > try to make them usable. Can anyone suggest any reason why it might be > risky to do so, moving on as much as I could to the new barriers, > sticking to the 4.19-one only for non-mem out? As in, could mixing the > barriers be riskier than reverting to the 4.19 barriers everywhere? > > Thanks in advance for any insights and recommendations, Thanks for this work! I'm not yet sure if this completely solves the issue, but it's surely an improvement. Testing multiple reboots on Loongson Mini-PC with the libata driver, /proc/irq/14/spurious still shows tens of thousands spurious interrupts, e.g. count 79453 unhandled 76998 last_unhandled 4294673096 ms where as it with the legacy IDE driver it stays reliably at 0 and the total count is well under 1. However, with your patch the "nobody cared" is never reached so all is good. I tried 10 boots with the patch and all were successful. Without the patch 8 out of 10 failed with the "nobody cared" warning. A. > diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h > index 845fbbc7a2e3..04be4758d4ff 100644 > --- a/arch/mips/include/asm/io.h > +++ b/arch/mips/include/asm/io.h > @@ -416,7 +416,7 @@ static inline void pfx##out##bwlq##p(type val, unsigned > long port)\ > volatile type *__addr; \ > type __val; \ > \ > - if (barrier)\ > + if (barrier > 0)\ > iobarrier_rw(); \ > else\ > war_io_reorder_wmb(); \ > @@ -467,13 +467,22 @@ BUILDIO_MEM(w, u16) > BUILDIO_MEM(l, u32) > BUILDIO_MEM(q, u64) > > -#define __BUILD_IOPORT_PFX(bus, bwlq, type) \ > - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \ > - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p) > +#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \ > + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \ > + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p) > + > +/* Choose the kind of barrier used for out in __BUILD_IOPORT_SINGLE in > + non-__mem_ variants. On Loongson2F, irq-i8259 fails to initialize > + when this is defined to 1. */ > +#if defined(CONFIG_CPU_LOONGSON2) > +#define USE_IO_BARRIER_FOR_NON_MEM_OUT -1 > +#else > +#define USE_IO_BARRIER_FOR_NON_MEM_OUT 1 > +#endif > > #define BUILDIO_IOPORT(bwlq, type) \ > - __BUILD_IOPORT_PFX(, bwlq, type)\ > - __BUILD_IOPORT_PFX(__mem_, bwlq, type) > + __BUILD_IOPORT_PFX(, bwlq, type, USE_IO_BARRIER_FOR_NON_MEM_OUT) \ > + __BUILD_IOPORT_PFX(__mem_, bwlq, type, 2) > > BUILDIO_IOPORT(b, u8) > BUILDIO_IOPORT(w, u16) > > > -- > Alexandre Oliva, freedom fighter he/him https://FSFLA.org/blogs/lxo > Be the change, be Free! FSF Latin America board member > GNU Toolchain EngineerFree Software Evangelist > Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Mar 8, 2019, "Maciej W. Rozycki" wrote: > Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system? No, it expands to `barrier' on the yeeloong: CONFIG_CPU_LOONGSON2F=y CONFIG_CPU_LOONGSON2F_WORKAROUNDS=y CONFIG_CPU_LOONGSON2=y CONFIG_SYS_HAS_CPU_LOONGSON2F=y I've finally managed to do the bisection on object files I mentioned I'd do to try to pinpoint where __BUILT_IOPORT_PFX with barrier rather than !barrier regressed. I found that forcing barrier off for drivers/irqchip/irq-i8259 was enough to avoid the problem. (I further narrowed it down to byte I/O, which is no surprise considering irq-i8259 doesn't seem to use any non-byte I/O.) Then I narrowed it down to output only. A Loongson2F kernel built with the patch below works at normal speed. I've also keyed the -1 barrier selector to compiling the irq-i8259 driver only, and that got me a functional kernel, but I'm not confident that the same issues that affect the interrupt controller, preventing it from initializing correctly, is not also affecting other drivers, just in less visible ways, so the patch conservatively reverts to the older barriers for all I/O (i.e., non-mem) out primitives. I've tested this on a yeeloong on top of v5.1.5. I'm tempted to start using this patch in my Freeloong builds of GNU Linux-libre for gnewsense/yeeloong of 5.0 and 5.1 stable releases, to try to make them usable. Can anyone suggest any reason why it might be risky to do so, moving on as much as I could to the new barriers, sticking to the 4.19-one only for non-mem out? As in, could mixing the barriers be riskier than reverting to the 4.19 barriers everywhere? Thanks in advance for any insights and recommendations, diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h index 845fbbc7a2e3..04be4758d4ff 100644 --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -416,7 +416,7 @@ static inline void pfx##out##bwlq##p(type val, unsigned long port) \ volatile type *__addr; \ type __val; \ \ - if (barrier)\ + if (barrier > 0)\ iobarrier_rw(); \ else\ war_io_reorder_wmb(); \ @@ -467,13 +467,22 @@ BUILDIO_MEM(w, u16) BUILDIO_MEM(l, u32) BUILDIO_MEM(q, u64) -#define __BUILD_IOPORT_PFX(bus, bwlq, type)\ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p) +#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \ + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \ + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p) + +/* Choose the kind of barrier used for out in __BUILD_IOPORT_SINGLE in + non-__mem_ variants. On Loongson2F, irq-i8259 fails to initialize + when this is defined to 1. */ +#if defined(CONFIG_CPU_LOONGSON2) +#define USE_IO_BARRIER_FOR_NON_MEM_OUT -1 +#else +#define USE_IO_BARRIER_FOR_NON_MEM_OUT 1 +#endif #define BUILDIO_IOPORT(bwlq, type) \ - __BUILD_IOPORT_PFX(, bwlq, type)\ - __BUILD_IOPORT_PFX(__mem_, bwlq, type) + __BUILD_IOPORT_PFX(, bwlq, type, USE_IO_BARRIER_FOR_NON_MEM_OUT) \ + __BUILD_IOPORT_PFX(__mem_, bwlq, type, 2) BUILDIO_IOPORT(b, u8) BUILDIO_IOPORT(w, u16) -- Alexandre Oliva, freedom fighter he/him https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Thu, 7 Mar 2019, Alexandre Oliva wrote: > > Does your platform use `war_io_reorder_wmb'? > > Err... I'm not sure I understand your question. > > It uses it in __BUILD_IOPORT_SINGLE within the expanded out function, > given !barrier, but you already knew that. > > Did you mean to ask what war_io_reorder_wmb expand to, or whether there > are other uses of war_io_reorder_wmb, or what? Umm, my question was ill-formed, sorry. There's a convoluted history recorded for that macro in the repo and it could be that it can be removed now that we have proper barriers in place, except possibly to avoid the Octeon Errata Core-301 (whatever that is). Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system? Maciej
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Mar 7, 2019, "Maciej W. Rozycki" wrote: > So this seems backwards to me, port I/O is supposed to be strongly > ordered, so if removing the ordering guarantee "fixes" your problem, then > there must be a second bottom here. Well, it partially restores the earlier state, so the actual bug goes back to latent. We just have to find out what it is. My plan, in the next time slot I can devote to this investigation, is to try to bisect source files that use these interfaces to identify a minimal set that can switch to the new barriers without breaking anything, then report back after failing to make sense of it myself ;-) So, yeah, several more bottoms to dig through. > Does your platform use `war_io_reorder_wmb'? Err... I'm not sure I understand your question. It uses it in __BUILD_IOPORT_SINGLE within the expanded out function, given !barrier, but you already knew that. Did you mean to ask what war_io_reorder_wmb expand to, or whether there are other uses of war_io_reorder_wmb, or what? -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Mar 7, 2019, Aaro Koskinen wrote: > Hi, > On Thu, Mar 07, 2019 at 03:41:01AM -0300, Alexandre Oliva wrote: >> On Feb 17, 2019, "Maciej W. Rozycki" wrote: >> >> > Is there an MMIO completion barrier missing there somewhere by any chance >> > causing an IRQ that has been handled already to be redelivered because an >> > MMIO write meant to clear the IRQ at its origin at handler's completion >> > has not reached its destination before interrupts have been reenabled in >> > the issuing CPU? Just a thought. >> >> I've finally got a chance to bisect the IRQ14 (nobody cared) regression >> on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO >> accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f). > This is interesting, thanks for the research, but I'm afraid it doesn't seem > to be the root cause. Oh, I didn't mean to offer anything definitive, just to report the findings of my investigation so far, since I had no clue when I'd find another significant time slot to get back onto it. > Could you check your /proc/interrupts counters after the boot with > your change? 16k to 18k interrupts after booting up into multi-user mode, including some idle time (and possibly anacron jobs) in the case that got more interrupts. That's not unlike what I get with 4.19.26-gnu. -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi, On Thu, Mar 07, 2019 at 03:41:01AM -0300, Alexandre Oliva wrote: > On Feb 17, 2019, "Maciej W. Rozycki" wrote: > > > Is there an MMIO completion barrier missing there somewhere by any chance > > causing an IRQ that has been handled already to be redelivered because an > > MMIO write meant to clear the IRQ at its origin at handler's completion > > has not reached its destination before interrupts have been reenabled in > > the issuing CPU? Just a thought. > > I've finally got a chance to bisect the IRQ14 (nobody cared) regression > on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO > accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f). This is interesting, thanks for the research, but I'm afraid it doesn't seem to be the root cause. While your patch seems to help also on Mini-PC (only briefly tested with a dozen of reboots), there's still excessive amount of interrupts during the boot - I'm getting something like 5-7, while with the old IDE driver it's around 2500. ~ # cat /proc/irq/14/spurious count 57805 unhandled 57799 last_unhandled 4294673092 ms ~ # cat /proc/interrupts CPU0 2: 0XT-PIC 2 cascade 3:174XT-PIC 3 ttyS0 5: 14653XT-PIC 5 timer 11: 0XT-PIC 11 ehci_hcd:usb1, ohci_hcd:usb2 14: 57805XT-PIC 14 pata_amd 15: 0XT-PIC 15 pata_amd 18: 0 MIPS 2 cascade 22: 0 MIPS 6 cascade ERR: 0 Could you check your /proc/interrupts counters after the boot with your change? A.
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi Alexandre, I'm away on holiday and also connectivity is so-so here, so just a quick reply. > > Is there an MMIO completion barrier missing there somewhere by any chance > > causing an IRQ that has been handled already to be redelivered because an > > MMIO write meant to clear the IRQ at its origin at handler's completion > > has not reached its destination before interrupts have been reenabled in > > the issuing CPU? Just a thought. > > I've finally got a chance to bisect the IRQ14 (nobody cared) regression > on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO > accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f). Thanks for looking into it. > I've only just started trying to figure out what exactly in the change > leads to problems. So far, I've determined that changing both uses of > __BUILD_IOPORT_SINGLE so that barrier is passed as 0 rather than 1 > removes the undesirable effects, both on top of that patch, and on top > of v5.0: > > #define __BUILD_IOPORT_PFX(bus, bwlq, type) \ > - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \ > - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p) > + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0,) \ > + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0, _p) So this seems backwards to me, port I/O is supposed to be strongly ordered, so if removing the ordering guarantee "fixes" your problem, then there must be a second bottom here. Offhand either there is a race condition somewhere which the lack of ordering here covers somehow, or there is a silicon erratum of some sort somewhere that the SYNC instruction triggers. A further investigation is required I'm afraid. Does your platform use `war_io_reorder_wmb'? Maciej
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 17, 2019, "Maciej W. Rozycki" wrote: > Is there an MMIO completion barrier missing there somewhere by any chance > causing an IRQ that has been handled already to be redelivered because an > MMIO write meant to clear the IRQ at its origin at handler's completion > has not reached its destination before interrupts have been reenabled in > the issuing CPU? Just a thought. I've finally got a chance to bisect the IRQ14 (nobody cared) regression on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f). I've only just started trying to figure out what exactly in the change leads to problems. So far, I've determined that changing both uses of __BUILD_IOPORT_SINGLE so that barrier is passed as 0 rather than 1 removes the undesirable effects, both on top of that patch, and on top of v5.0: #define __BUILD_IOPORT_PFX(bus, bwlq, type)\ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p) + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0,) \ + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0, _p) Since the barriers didn't seem to be a problem for __BUILD_MEMORY_PFX, I figured I'd try to enable barriers in the __mem_ variants, but leave them alone for io, and that worked (without hitting the IRQ14 issue) at least on the yeeloong: diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h index 845fbbc7a2e34..0a3a327d4e764 100644 --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -467,13 +467,13 @@ BUILDIO_MEM(w, u16) BUILDIO_MEM(l, u32) BUILDIO_MEM(q, u64) -#define __BUILD_IOPORT_PFX(bus, bwlq, type)\ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \ - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p) +#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \ + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \ + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p) #define BUILDIO_IOPORT(bwlq, type) \ - __BUILD_IOPORT_PFX(, bwlq, type)\ - __BUILD_IOPORT_PFX(__mem_, bwlq, type) + __BUILD_IOPORT_PFX(, bwlq, type, 0) \ + __BUILD_IOPORT_PFX(__mem_, bwlq, type, 1) BUILDIO_IOPORT(b, u8) BUILDIO_IOPORT(w, u16) -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Sun, 17 Feb 2019, Alexandre Oliva wrote: > That's a reasonable guess, but I don't think so. I do have PATA_AMD > enabled as a module indeed, but it's not even loaded, much as I can > tell, whereas PATA_CS5536 is built into the kernel image, and dmesg > says: > > [4.46] scsi host0: pata_cs5536 > [4.464000] scsi host1: pata_cs5536 > [4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14 > [4.464000] ata2: DUMMY > [4.464000] pcnet32: [...] > [4.644000] random: [...] > [5.908000] irq 14: nobody cared (try booting with the "irqpoll" option) Is there an MMIO completion barrier missing there somewhere by any chance causing an IRQ that has been handled already to be redelivered because an MMIO write meant to clear the IRQ at its origin at handler's completion has not reached its destination before interrupts have been reenabled in the issuing CPU? Just a thought. Maciej
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 17, 2019, Aaro Koskinen wrote: > I tested few older kernels, and it seems that the spurious IRQ issue has > been always there after switching to libata (commit 7ff7a5b1bfff). It has > been unnoticed as the 10 irq limit wasn't reached during boot. I see, thanks. That would probably make it hard to bisect indeed. >> The kernel still disables irq14 early on, and then runs slow. > This hack works only for CONFIG_PATA_CS5536. You are probably using PATA_AMD. That's a reasonable guess, but I don't think so. I do have PATA_AMD enabled as a module indeed, but it's not even loaded, much as I can tell, whereas PATA_CS5536 is built into the kernel image, and dmesg says: [4.46] scsi host0: pata_cs5536 [4.464000] scsi host1: pata_cs5536 [4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14 [4.464000] ata2: DUMMY [4.464000] pcnet32: [...] [4.644000] random: [...] [5.908000] irq 14: nobody cared (try booting with the "irqpoll" option) Just to be sure, I added some printks to cs5536_noop_freeze, and here's what I got in dmesg instead: [4.452000] pata_cs5536: freeze: checking status... [4.452000] pata_cs5536: freeze: checked, clearing... [4.452000] pata_cs5536: freeze: cleared [4.46] scsi host0: pata_cs5536 [4.464000] scsi host1: pata_cs5536 [4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14 [4.464000] ata2: DUMMY [4.464000] pcnet32: [...] [4.468000] pata_cs5536: freeze: checking status... [4.468000] pata_cs5536: freeze: checked, clearing... [4.468000] pata_cs5536: freeze: cleared [4.652000] random: [...] [5.92] irq 14: nobody cared (try booting with the "irqpoll" option) now, maybe I just don't understand what effects the patch was supposed to have. The system still feels very slow, but I haven't timed anything; could it be that it had the effect of keeping the IRQ functional after all, but 5.0-rc6 is slower than earlier kernels for other reasons? Like, /proc/irq/14/pata_cs5536/ is there, but I haven't checked whether it was there before the patch. Do you suggest any way to tell whether it had the intended effect? -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Sun, Feb 17, 2019 at 01:59:26AM -0300, Alexandre Oliva wrote: > On Feb 11, 2019, Aaro Koskinen wrote: > > > ATA (libata) CS5536 driver is having issues with spurious IRQs and often > > disables IRQs completely during the boot. You should see a warning > > in dmesg. This was the reason for slowness on my FuLoong mini-PC. A > > workaround is to switch to old IDE driver. > > Thanks. I see a NIEN quirk in ide-iops.c that's enabled for the hard > drive model I've got on my yeeloong, but that's not even compiled in my > freeloong builds. I don't see any changes in libata between 4.19 and > 4.20 that could explain the regression either. I tested few older kernels, and it seems that the spurious IRQ issue has been always there after switching to libata (commit 7ff7a5b1bfff). It has been unnoticed as the 10 irq limit wasn't reached during boot. But since libata probe is asynchronous some other kernel thread may delay it, and apparently some change in recent kernels adds enough delay to make the IRQ count to hit the limit. > I'm afraid there's no observable change in behavior after installing the > proposed patch at > https://lore.kernel.org/linux-mips/20190106124607.gk27...@darkstar.musicnaut.iki.fi/ > > The kernel still disables irq14 early on, and then runs slow. This hack works only for CONFIG_PATA_CS5536. You are probably using PATA_AMD. A.
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 11, 2019, Aaro Koskinen wrote: > ATA (libata) CS5536 driver is having issues with spurious IRQs and often > disables IRQs completely during the boot. You should see a warning > in dmesg. This was the reason for slowness on my FuLoong mini-PC. A > workaround is to switch to old IDE driver. Thanks. I see a NIEN quirk in ide-iops.c that's enabled for the hard drive model I've got on my yeeloong, but that's not even compiled in my freeloong builds. I don't see any changes in libata between 4.19 and 4.20 that could explain the regression either. I'm afraid there's no observable change in behavior after installing the proposed patch at https://lore.kernel.org/linux-mips/20190106124607.gk27...@darkstar.musicnaut.iki.fi/ The kernel still disables irq14 early on, and then runs slow. Tom, why do you say bisecting this is impossible? I realize you wrote you did so for 24 hours non-stop, but... I'm curious as to what obstacles you ran into. It's such a reproducible problem for me that I can't see how bisecting it might be difficult. Or were by any chance you talking about the reboot/shutdown problem then? -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 11, 2019, Tom Li wrote: > We've just identified and confirmed the source of the shutdown problem a > few days ago on this mailing list. > You can pick up the patch from: > https://lore.kernel.org/lkml/20190207205812.ga11...@darkstar.musicnaut.iki.fi/ > A patch has been authored and submitted by Aaro Koskinen, but currently it > seems stuck in the mailing list because the maintainer worries about > regression > and asks for testing on more MIPS systems, although we believe it's a trivial > patch. Thanks. I've just brought it into my local copies of loongson-community branches for 4.4, 4.9, 4.14, 4.19, and 4.20, and master, after verifying that it enables at least 4.19, 4.20 and master to reboot without a power cycle, and I'll include it in upcoming freeloong builds, except for 3.16 and 3.18, where it doesn't apply. > In addition, I've discovered and fixed another bug preventing the machine from > shutting down, this patch has already merged into the Linus tree. > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a96669d77897ff3613157bf43f875739205d66d Hmm, I don't think I've ever hit this one. Do you happen to know how far back it might be needed? > I'll continue working on upstreaming these out-of-tree drivers as my personal > project. I hope you'll be able to use a fully-functional machine with the > mainline > kernel soon, my current target is Linux 5.3. Thanks! -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 11, 2019, Aaro Koskinen wrote: > ATA (libata) CS5536 driver is having issues with spurious IRQs and often > disables IRQs completely during the boot. You should see a warning > in dmesg. Yup, thanks, it shows up first thing during boot. I hadn't seen that one in a while. Thanks. -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
> > > > That doesn't seem to explain even a quiet boot up taking several times > > longer than 4.19, and package installation over an ethernet connection > > (thus not using the console) also taking several times longer. > > Maybe it's a slow disk? > > ATA (libata) CS5536 driver is having issues with spurious IRQs and often > disables IRQs completely during the boot. You should see a warning > in dmesg. This was the reason for slowness on my FuLoong mini-PC. A > workaround is to switch to old IDE driver. > > Discussion: > https://lore.kernel.org/linux-mips/20190106124607.gk27...@darkstar.musicnaut.iki.fi/ > > A. I find that this problem is odd. I've been using libata personally since 2014 and never had any IRQ problem at all. How can the problem even occur at all? Perhaps related to a rarely used kernel option that trigger it, or a bug in the PMON boot firmware? I don't know... But I think some versions of the PMON boot firmware have odd bugs. I'm currently using the PMON firmware built by an OpenBSD developer, but I cannot find the link anymore. Tom Li signature.asc Description: Digital signature
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi, On Mon, Feb 11, 2019 at 08:38:00PM -0200, Alexandre Oliva wrote: > On Feb 11, 2019, Tom Li wrote: > >> and, more recently, a very slow system overall, that's been present since > >> 4.20. > > > The mainline framebuffer driver doesn't have any hardware drawing, printing > > even a single line on the console is required a full screen redraw via > > memory- > > That doesn't seem to explain even a quiet boot up taking several times > longer than 4.19, and package installation over an ethernet connection > (thus not using the console) also taking several times longer. Maybe it's a slow disk? ATA (libata) CS5536 driver is having issues with spurious IRQs and often disables IRQs completely during the boot. You should see a warning in dmesg. This was the reason for slowness on my FuLoong mini-PC. A workaround is to switch to old IDE driver. Discussion: https://lore.kernel.org/linux-mips/20190106124607.gk27...@darkstar.musicnaut.iki.fi/ A.
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 11, 2019, Tom Li wrote: > What is the current link to your repository? git://dev.lemote.com/linux-loongson-community.git I use a different git URL to push stuff, so I wouldn't have noticed if the above was gone, but I've just tried to git clone form the above and it seems to remain functional. > We've just identified and confirmed the source of the shutdown problem a > few days ago on this mailing list. It's related to a PREEMPT race condition > in the cpufreq framework that triggers a timing-dependent, probabilistic > lockup in the i8259 PIC driver. And it was first noticed by Ville Syrjälä > on a x86 system, my debugging rediscovered his findings. Oooh, thanks for the links. >> and, more recently, a very slow system overall, that's been present since >> 4.20. > The mainline framebuffer driver doesn't have any hardware drawing, printing > even a single line on the console is required a full screen redraw via memory- That doesn't seem to explain even a quiet boot up taking several times longer than 4.19, and package installation over an ethernet connection (thus not using the console) also taking several times longer. -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Mon, Feb 11, 2019 at 10:13:00AM -0200, Alexandre Oliva wrote: > On Feb 8, 2019, Tom Li wrote: > > > found Alexandre Oliva has stopped maintaining his tree > > ?!? > > I still merge and tag every one of Torvalds' and Greg KH's releases into > the loongson-community tree, resolving trivial conflicts and trying to > verify that it at least builds and passes a smoke test on actual > hardware (after Linux-libre deblobbing, but still). What is the current link to your repository? The original one from Lemote is now a 404, and the GitHub one has been "archived". > Two issues that bother me a bit are frequent failure to reboot/poweroff, > that's been around since around 4.10, and, more recently, a very slow > system overall, that's been present since 4.20. I haven't checked > whether they're caused by changes in the loongson-community tree that I > still carry, or if they're already present in upstream releases. > > I've already tried to bisect the former, but since the issue is > intermittent, that proved to be too tricky and time-consuming for me. > > I haven't tried to bisect the latter yet. You cannot bisect it. Because I've tried it for 24 hours without sleep, but failed. ;-) We've just identified and confirmed the source of the shutdown problem a few days ago on this mailing list. It's related to a PREEMPT race condition in the cpufreq framework that triggers a timing-dependent, probabilistic lockup in the i8259 PIC driver. And it was first noticed by Ville Syrjälä on a x86 system, my debugging rediscovered his findings. You can read the full investigation at here: https://lkml.org/lkml/2018/11/13/857 You can pick up the patch from: https://lore.kernel.org/lkml/20190207205812.ga11...@darkstar.musicnaut.iki.fi/ A patch has been authored and submitted by Aaro Koskinen, but currently it seems stuck in the mailing list because the maintainer worries about regression and asks for testing on more MIPS systems, although we believe it's a trivial patch. In addition, I've discovered and fixed another bug preventing the machine from shutting down, this patch has already merged into the Linus tree. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a96669d77897ff3613157bf43f875739205d66d > and, more recently, a very slow system overall, that's been present since > 4.20. The mainline framebuffer driver doesn't have any hardware drawing, printing even a single line on the console is required a full screen redraw via memory- mapped I/O over the slow PCI interface. I have rebased the 2D acceleration code against the latest upstream driver and sent the patch to the fbdev mailing list on February 2nd. But the maintainers seem to be working on other tasks and the patches (including mine and others') on the list didn't receive any response. It's avaliable from https://patchwork.kernel.org/project/linux-fbdev/list/?series=75029 > I've considered leaving the loongson-community tree behind and using > upstream releases, but every time I was about to do that, something else > came up that led me to keep at it. I think the last such occurrence was > the removal of the video driver (later reintroduced as a new driver). > So I've kept at it ;-) I'll continue working on upstreaming these out-of-tree drivers as my personal project. I hope you'll be able to use a fully-functional machine with the mainline kernel soon, my current target is Linux 5.3. Cheers, Tom Li signature.asc Description: PGP signature
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
On Feb 8, 2019, Tom Li wrote: > found Alexandre Oliva has stopped maintaining his tree ?!? I still merge and tag every one of Torvalds' and Greg KH's releases into the loongson-community tree, resolving trivial conflicts and trying to verify that it at least builds and passes a smoke test on actual hardware (after Linux-libre deblobbing, but still). Occasionally I notice problems and I try to investigate through bisecting, but, not being a real Linux developer, I seldom get much further than that. Two issues that bother me a bit are frequent failure to reboot/poweroff, that's been around since around 4.10, and, more recently, a very slow system overall, that's been present since 4.20. I haven't checked whether they're caused by changes in the loongson-community tree that I still carry, or if they're already present in upstream releases. I've already tried to bisect the former, but since the issue is intermittent, that proved to be too tricky and time-consuming for me. I haven't tried to bisect the latter yet. I've considered leaving the loongson-community tree behind and using upstream releases, but every time I was about to do that, something else came up that led me to keep at it. I think the last such occurrence was the removal of the video driver (later reintroduced as a new driver). So I've kept at it ;-) -- Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo Be the change, be Free! FSF Latin America board member GNU Toolchain EngineerFree Software Evangelist Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi Tom, On Sat, Feb 09, 2019 at 06:11:33PM +0800, Tom Li wrote: > > To address the particular quote you give from Dmitry Torokhov on the > > yeeloong_hotkey driver - just because the driver as-is includes a bunch > > of non-input related things doesn't mean that it should or has to. From > > a look at the 2009 submission it seems to mix a bunch of policy into the > > kernel which really ought to be elsewhere. Generally the input driver > > reports that a key was pressed & something in userland decides what to > > do with it, whereas this driver seems to attempt to bypass that & prod > > at unrelated hardware all by itself. > > Sure, the hotkey driver has some problems in its current shape. I think > the existing code makes some hotkeys on the keyboard behave like a hardware > switch to order to implement rfkill hardblock, and also controls the video > output switch. I think I need to investigate it further. >% > I find reorganization of the current Yeeloong platform driver is relatively > easy, since it only involves one machine, I'm already working on it. > > If future developers find it's difficult to support new machines, we can > simply > have more discussion, reorganize the existing hierarchy further, and make > incremental changes. Thanks for working on it, I look forward to seeing your patches :) Paul
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
> To address the particular quote you give from Dmitry Torokhov on the > yeeloong_hotkey driver - just because the driver as-is includes a bunch > of non-input related things doesn't mean that it should or has to. From > a look at the 2009 submission it seems to mix a bunch of policy into the > kernel which really ought to be elsewhere. Generally the input driver > reports that a key was pressed & something in userland decides what to > do with it, whereas this driver seems to attempt to bypass that & prod > at unrelated hardware all by itself. Sure, the hotkey driver has some problems in its current shape. I think the existing code makes some hotkeys on the keyboard behave like a hardware switch to order to implement rfkill hardblock, and also controls the video output switch. I think I need to investigate it further. > Personally I see no reason for that hwmon driver to live under > drivers/platform/mips rather than drivers/hwmon. All that does is bypass > the attention of the drivers/hwmon/ maintainers who would be best placed > to offer input & ensure the driver is actually any good. Yes. I think if our conclusion is "drivers/platform/mips is not a good idea", it should be moved to the appropriate category in the future. > I think that question should prompt another - if we have maintainers for > various driver subsystems, why not place the drivers under their care in > the already established directories? > > Thanks, > Paul Thanks for your reply, I'm fully agree with your comments about platform drivers. I find reorganization of the current Yeeloong platform driver is relatively easy, since it only involves one machine, I'm already working on it. If future developers find it's difficult to support new machines, we can simply have more discussion, reorganize the existing hierarchy further, and make incremental changes. Cheers, Tom Li Beijing GNU/Linux User Group. signature.asc Description: PGP signature
Re: [RFC] On the Current Troubles of Mainlining Loongson Platform Drivers
Hi Tom, On Fri, Feb 08, 2019 at 04:30:39PM +0800, Tom Li wrote: > Hello, Greg, Ralf, Paul, James, Alexandre and Huacai > > Many years ago when I was still in the middle scheool, I got a Loongson > Yeeloong laptop to explore the world of non-x86 world, as Geert Uytterhoeven > once said, there's lots of Linux beyond IA-32. At that time I've noticed there > was no platform driver for the laptop. I just assumed that nobody bothered > mainlining it and everyone were using Alexandre Oliva's tree. I've diffed > his tree, got the patches, discovered and fixed two minor bugs and submitted > to Alexandre, and called it a day. > > Years later, I stared at my machine, and found Alexandre Oliva has stopped > maintaining his tree, yet there is still no platform driver in the mainline. > So I decided that submitting the Yeeloong driver to the mainline kernel is > the my current personal project of this year. I only work for myself for fun, > not for profit. I see Loongson as an interesting non-x86 platform, but I do > not work for Loongson, do not explictly support them, nor supported by them > in any way. > > Reviewing the previous E-mails and patches at Linux/MIPS mailing list, I've > noticed that, from 2009 to 2019, in a 10-year span, there have been at least > 3 attempts by 3 people to submit the platform driver for Yeeloong laptop to > the Linux/MIPS tree, all implicitly rejected. This is not meant to be a > criticism for maintainers' hard work and high quality standard, but rather, > I think it shows a serious challenge on development of platform drivers that > worth discussing. since there's even more unsubmitted platform drivers for > more devices, especially Loongson3. We need to find a solution both for > short-term and long-term further development of these platform drivers. > > Here, I give a quick summary about the situation, and see what can I/we > do about it. Digging into the mailing list, in the hindsight, it's clear > to see the fate of this platform driver. In December 2009, Wu Zhangin > submitted a refined version of the initial platform driver to the mailing > list. > > His intention was to create a MIPS-equivalent version of the x86 platform > drivers, as he said, > > > I like the folks did under drivers/platform/x86/ ;) which > > will be better to maintain. for all of these drivers are really only > > YeeLoong platform specific ;) > > However, Ralf Baechle commented, that > > > You split old, big driver into several individual drivers - good. > [...] > > Experience has shown that drivers for a particular subsystem are best > > combined in a single menu, in a single directory. Otherwise any changes > > to subsystem's internal APIs will become a major pain. > > Dmitry Torokhov gave another response, > > > Umm, it still mixes up bunch of stuff not directly related to input. > > I'd vote for drivers/platform/mips (since we have a few of kitchen-sink > > style drivers for x86-based laptops in drivers/platform/x86). > > This discussion is frozen without any progress in the following years. > > For example, when the driver was resubmitted to James Hogan in late 2018, > he noted, possibly without knowledge of the discussion in 2009, > > > I can't help thinking it would be better to separate this into separate > > drivers for each part (backlight, power supply etc), and move them into > > the appropriate driver directories (drivers/power/supply, > > drivers/video/backlight etc). That way each part would get proper review > > from the appropriate maintainers (or at least they should be Cc'd). > > > > Is there a particular reason for it to be a single driver? > > So I think it's the time to restart this discussion. Should the driver be > submitted as a platform driver, or be splitted and merged into various > subsystems? I've considered reasons from both sides. Right off the bat, let me express my whole-hearted agreement with James & the question he asked in that quote. If there is a good reason for some things to be one large driver then we can discuss that, but so far I don't believe anybody has given one - nobody replied to James' mail that you quote [1]. It looks to me like in 2009 someone did the work to split the driver up, which shows that it's possible, but the drivers were placed under arch/mips which is not where they belong. They should individually be placed at the right points under drivers/, and submitted to the relevant subsystem maintainers. To address the particular quote you give from Dmitry Torokhov on the yeeloong_hotkey driver - just because the driver as-is includes a bunch of non-input related things doesn't mean that it should or has to. From a look at the 2009 submission it seems to mix a bunch of policy into the kernel which really ought to be elsewhere. Generally the input driver reports that a key was pressed & something in userland decides what to do with it, whereas this driver seems to attempt to bypass that & prod at unrelated hardware all by itself. >