Re: [PATCH v2 00/91] drm/vc4: Support BCM2711 Display Pipelin
On Wed, May 27, 2020 at 5:13 PM Maxime Ripard wrote: > I'm about to send a v3 today or tomorrow, I can Cc you (and Jian-Hong) if you > want. That would be great, although given the potentially inconsistent results we've been seeing so far it would be great if you could additionally push a git branch somewhere. That way we can have higher confidence that we are applying exactly the same patches to the same base etc. Thanks Daniel
Re: [PATCH v2 00/91] drm/vc4: Support BCM2711 Display Pipelin
Hi Maxime, On Tue, May 26, 2020 at 6:20 PM Maxime Ripard wrote: > I gave it a try with U-Boot with my latest work and couldn't reproduce it, so > it > seems that I fixed it along the way Is your latest work available in a git branch anywhere that we could test directly? Thanks Daniel
Re: [PATCH v4 0/3] Replace private domain with per-group default domain
On Wed, May 6, 2020 at 10:03 AM Lu Baolu wrote: > https://lkml.org/lkml/2020/4/14/616 > [This has been applied in iommu/next.] > > Hence, there is no need to keep the private domain implementation > in the Intel IOMMU driver. This patch series aims to remove it. I applied these patches on top of Joerg's branch and confirmed that they fix the issue discussed in the thread: [PATCH v2] iommu/vt-d: consider real PCI device when checking if mapping is needed (the patch there is no longer needed) Tested-by: Daniel Drake Thanks!
Re: [PATCH] PCI: PM: Fix pci_power_up()
On Mon, Oct 14, 2019 at 7:25 PM Rafael J. Wysocki wrote: > Since there is no reason for that difference to exist, modify > pci_power_up() to follow pci_set_power_state() more closely and > invoke __pci_start_power_transition() from there to call the > platform firmware to power up the device (in case that's necessary). > > Fixes: db288c9c5f9d ("PCI / PM: restore the original behavior of > pci_set_power_state()") > Reported-by: Daniel Drake > Link: > https://lore.kernel.org/linux-pm/cad8lp44tyxrmgplkhcqf9hv6smeurmxvmmvmtyfhz6q4se+...@mail.gmail.com/T/#m21be74af263c6a34f36e0fc5c77c5449d9406925 > Signed-off-by: Rafael J. Wysocki > --- > > Daniel, please test this one. This one is working too, thanks Daniel
Re: [RFD] x86/tsc: Loosen the requirements for watchdog - (was x86/hpet: Disable HPET on Intel Coffe Lake)
Hi Thomas, On Fri, Aug 30, 2019 at 5:38 AM Thomas Gleixner wrote: > So if we have to disable the HPET on Kaby Lake alltogether unless Intel > comes up with the clever fix, i.e. poking at the right registers, then I > think we should also lift the TSC watchdog restrictions on these machines > if they are single socket, which they are as the affected CPUs so far are > mobile and client types. > > Also given the fact that we get more and more 'reduced' hardware exposed > via ACPI and we already dealt with quite some fallout with various related > issues due to that, I fear we need to bite this bullet anyway anytime soon. Thanks for the explanation here! My experience in this area is basically limited to the clock-related issues that I've sent your way recently, so I don't have deep wisdom to draw upon, but what you wrote here makes sense to me. If you can outline a testing procedure, we can test upcoming patches on Coffee Lake and Kaby Lake consumer laptops. Thanks, Daniel
Re: Early EFI-related boot freeze in parse_setup_data()
On Fri, Aug 16, 2019 at 2:14 PM Daniel Drake wrote: > Anyway, the system freeze occurs in parse_setup_data(), specifically: > > data = early_memremap(pa_data, sizeof(*data)); > data_len = data->len + sizeof(struct setup_data); > > Dereferencing data->len causes the system to hang. I presume it > triggers an exception handler due to some kind of invalid memory > access. > > By returning early in that function, boot continues basically fine. So > I could then log the details: pa_data has value 0x892bb018 and > early_memremap returns address 0xff200018. Accessing just a > single byte at that address causes the system hang. I noticed a complaint about NX in the logs, right where it does the early_memremap of this data (which is now at address 0x893c0018): Notice: NX (Execute Disable) protection missing in CPU! e820: update [mem 0x893c0018-0x893cec57] usable ==> usable e820: update [mem 0x893c0018-0x893cec57] usable ==> usable e820: update [mem 0x893b3018-0x893bf057] usable ==> usable e820: update [mem 0x893b3018-0x893bf057] usable ==> usable Indeed, in the BIOS setup menu, "NX Mode" was Disabled. Setting it to Enabled avoids the hang and Linux boots as normal. Weird! Daniel
Re: [PATCH v2 1/2] ASoC: es8316: fix headphone mixer volume table
On Mon, Aug 26, 2019 at 11:39 PM Katsuhiro Suzuki wrote: > > This patch fix setting table of Headphone mixer volume. > Current code uses 4 ... 7 values but these values are prohibited. > > Correct settings are the following: > -12dB > 0001 -10.5dB > 0010 -9dB > 0011 -7.5dB > 0100 -6dB > 1000 -4.5dB > 1001 -3dB > 1010 -1.5dB > 1011 0dB > > Signed-off-by: Katsuhiro Suzuki Reviewed-by: Daniel Drake
Re: [PATCH v2 2/2] ASoC: es8316: fix inverted L/R of headphone mixer volume
On Mon, Aug 26, 2019 at 11:39 PM Katsuhiro Suzuki wrote: > > This patch fixes inverted Left-Right channel of headphone mixer > volume by wrong shift_left, shift_right values. > > Signed-off-by: Katsuhiro Suzuki Agrees with the spec Reviewed-by: Daniel Drake
Re: [PATCH] ASoC: es8316: limit headphone mixer volume
On Mon, Aug 26, 2019 at 1:38 AM Hans de Goede wrote: > On 24-08-19 23:04, Katsuhiro Suzuki wrote: > > This patch limits Headphone mixer volume to 4 from 7. > > Because output sound suddenly becomes very loudly with many noise if > > set volume over 4. That sounds like something that should be limited in UCM. > Higher then 4 not working matches my experience, see this comment from > the UCM file: alsa-lib/src/conf/ucm/codecs/es8316/EnableSeq.conf : > > # Set HP mixer vol to -6 dB (4/7) louder does not work > cset "name='Headphone Mixer Volume' 4" What does "does not work" mean more precisely? I checked the spec, there is indeed something wrong in the kernel driver here. The db scale is not a simple scale as the kernel source suggests. Instead it is: – -12dB 0001 – -10.5dB 0010 – -9dB 0011 – -7.5dB 0100 – -6dB 1000 – -4.5dB 1001 – -3dB 1010 – -1.5dB 1011 – 0dB So perhaps we can fix the kernel to follow this table and then use UCM to limit the volume if its too high on a given platform? Thanks Daniel
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On Fri, Aug 23, 2019 at 9:54 AM ndrw wrote: > That's obviously a lot better than hard freezes but I wouldn't call such > system lock-ups an excellent result. PSI-triggered OOM killer would have > indeed been very useful as an emergency brake, and IMHO such mechanism > should be built in the kernel and enabled by default. But in my > experience it does a very poor job at detecting imminent freezes on > systems without swap or with very fast swap (zram). Perhaps you could share your precise test environment and the PSI condition you are expecting to hit (that is not being hit). Except for the single failure report mentioned, it's been working fine here in all setups, including with zram which is shipped out of the box. The nice thing about psi is that it's based on how much real-world time the kernel is spending doing memory management. So it's very well poised to handle differences in swap speed etc. You effectively just set the threshold for how much time you view as excessive for the kernel to be busy doing MM, and psi tells you when that's hit. > > There's just one issue we've seen so far: a single report of psi reporting > > memory pressure on a desktop system with 4GB RAM which is only running > > the normal desktop components plus a single gmail tab in the web browser. > > psi occasionally reports high memory pressure, so then psi-monitor steps in > > and > > kills the browser tab, which seems erroneous. > > Is it Chrome/Chromium? If so, that's a known bug > (https://bugs.chromium.org/p/chromium/issues/detail?id=333617) The issue does not concern which process is being killed. The issue is that in the single report we have of this, psi is apparently reporting high memory pressure even though the system has plenty of free memory. Daniel
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Hi, Artem S. Tashkinov wrote: > Once you hit a situation when opening a new tab requires more RAM than > is currently available, the system will stall hard. You will barely be > able to move the mouse pointer. Your disk LED will be flashing > incessantly (I'm not entirely sure why). You will not be able to run new > applications or close currently running ones. > > This little crisis may continue for minutes or even longer. I think > that's not how the system should behave in this situation. I believe > something must be done about that to avoid this stall. Thanks for reviving this discussion. Indeed, this is a real pain point in the Linux experience. For Endless, we sunk some time into this and emerged with psi being the best solution we could find. The way it works on a time basis seems very appropriate when what we're ultimately interested in is maintaining desktop UI interactivity. With psi enabled in the kernel, we add a small userspace daemon to kill a process when psi reports that *all* userspace tasks are being blocked on kernel memory management work for (at least) 1 second in a 10 second period. https://github.com/endlessm/eos-boot-helper/blob/master/psi-monitor/psi-monitor.c To share our results so far, despite this daemon being a quick initial implementation, we find that it is bringing excellent results, no more memory pressure hangs. The system recovers in less than 30 seconds, usually in more like 10-15 seconds. Sadly a process got killed along the way, but that's a lot better than the user having no option other than pulling the plug. The system may not always recover to a totally smooth state, but the responsiveness to mouse movements and clicks is still decent, so at that point the user can close some more windows to restore full UI performance again. There's just one issue we've seen so far: a single report of psi reporting memory pressure on a desktop system with 4GB RAM which is only running the normal desktop components plus a single gmail tab in the web browser. psi occasionally reports high memory pressure, so then psi-monitor steps in and kills the browser tab, which seems erroneous. We haven't had a chance to look at this in detail yet. Here's a log from the kernel OOM killer showing the memory and process state at this point. https://gist.github.com/dsd/b338bab0206dcce78263f6bb87de7d4a > I'm almost sure some sysctl parameters could be changed to avoid this > situation but something tells me this could be done for everyone and > made default because some non tech-savvy users will just give up on > Linux if they ever get in a situation like this and they won't be keen > or even be able to Google for solutions. As you anticipated, myself and others already jumped in with solutions appropriate for tech-savvy users. Getting solutions widely deployed is indeed another important aspect to tackle. If you're curious to see how this can look from a "just works" standpoint, you might be interested in downloading Endless (www.endlessos.com) and running your tests there; we have the above solution running and active out of the box. Bastien Nocera has recently adapted and extended our solution, presumably with an eye towards getting this more widely deployed as a standard part of the Linux desktop. https://gitlab.freedesktop.org/hadess/low-memory-monitor/ And if there is a meaningful way to make the kernel behave better, that would obviously be of huge value too. Thanks Daniel
Re: [PATCH] x86/apic: Handle missing global clockevent gracefully
On Mon, Aug 12, 2019 at 2:16 PM Daniel Drake wrote: > I can do a bit of testing on other platforms too. Are there any > specific tests I should run, other than checking that the system boots > and doesn't have any timer watchdog complaints in the log? Tested this on 2 AMD platforms that were not affected by the crash here. In addition to confirming that they boot fine without timer complaints in the logs, I checked the calibrate_APIC_clock() result before and after this patch. I repeated each test twice. Asus E402YA (AMD E2-7015) Before: 99811, 99811 After: 99812, 99812 Acer Aspire A315-21G (AMD A9-9420e) Before: 99811, 99811 After: 99807, 99820 Those new numbers seem very close to the previous ones and I didn't observe any problems. Thanks Daniel
Re: [RFC PATCH v7] rtl8xxxu: Improve TX performance of RTL8723BU on rtl8xxxu driver
On Mon, Aug 12, 2019 at 11:21 PM Jes Sorensen wrote: > On 8/12/19 10:32 AM, Kalle Valo wrote: > > This is marked as RFC so I'm not sure what's the plan. Should I apply > > this? > > I think it's at a point where it's worth applying - I kinda wish I had > had time to test it, but I won't be near my stash of USB dongles for a > little while. The last remaining reason it was RFC was pending feedback from Jes, to check that his earlier comments had been adequately addressed. So yes I think it's good to apply now. Thanks, Daniel
Re: [PATCH] x86/apic: Handle missing global clockevent gracefully
On Fri, Aug 9, 2019 at 8:54 PM Thomas Gleixner wrote: > Some newer machines do not advertise legacy timers. The kernel can handle > that situation if the TSC and the CPU frequency are enumerated by CPUID or > MSRs and the CPU supports TSC deadline timer. If the CPU does not support > TSC deadline timer the local APIC timer frequency has to be known as well. > > Some Ryzens machines do not advertize legacy timers, but there is no > reliable way to determine the bus frequency which feeds the local APIC > timer when the machine allows overclocking of that frequency. > > As there is no legacy timer the local APIC timer calibration crashes due to > a NULL pointer dereference when accessing the not installed global clock > event device. > > Switch the calibration loop to a non interrupt based one, which polls > either TSC (frequency known) or jiffies. The latter requires a global > clockevent. As the machines which do not have a global clockevent installed > have a known TSC frequency this is a non issue. For older machines where > TSC frequency is not known, there is no known case where the legacy timers > do not exist as that would have been reported long ago. This solves the problem I described in the thread: setup_boot_APIC_clock() NULL dereference during early boot on reduced hardware platforms Thanks for your quick support! > Note: Only lightly tested, but of course not on an affected machine. > > If that works reliably, then this needs some exhaustive testing > on a wide range of systems so we can risk backports to stable > kernels. I can do a bit of testing on other platforms too. Are there any specific tests I should run, other than checking that the system boots and doesn't have any timer watchdog complaints in the log? Thanks Daniel
Re: setup_boot_APIC_clock() NULL dereference during early boot on reduced hardware platforms
On Thu, Aug 1, 2019 at 3:16 PM Aubrey Li wrote: > No, the platform needs a global clock event, can you turn on some other > clock source on your platform, like HPET? Thanks Audrey and Thomas for the quick hints! I double checked under Windows - it seems to be using a HPET there. Also there is the HPET ACPI table. So I think this is the right angle to look at. Under Linux, hpet_legacy_clockevent_register() is the function where global_clock_event can be set to HPET. However, the only way this can be called is from hpet_enable(). hpet_enable() is called from 2 places: 1. From hpet_time_init(). This is the default x86 timer_init that acpi_generic_reduced_hw_init() took out of action here. 2. From hpet_late_init(). However that function is only called late, after calibrate_APIC_clock() has already crashed the kernel. Also, even if moved earlier it would also not call hpet_enable() here because the ACPI HPET table parsing has already populated hpet_address. I tried slotting in a call to hpet_enable() at an earlier point regardless, but I still end up with the kernel hanging later during boot, probably because irq0 fails to be setup and this error is hit: if (setup_irq(0, )) pr_info("Failed to register legacy timer interrupt\n"); I'll go deeper into that; further hints welcome too. Thanks Daniel
setup_boot_APIC_clock() NULL dereference during early boot on reduced hardware platforms
Hi, Working with a new consumer laptop based on AMD R7-3700U, we are seeing a kernel panic during early boot (before the display initializes). It's a new product and there is no previous known working kernel version (tested 5.0, 5.2 and current linus master). We may have also seen this problem on a MiniPC based on AMD APU 7010 from another vendor, but we don't have it in hands right now to confirm that it's the exact same crash. earlycon shows the details: a NULL dereference under setup_boot_APIC_clock(), which actually happens in calibrate_APIC_clock(): /* Replace the global interrupt handler */ real_handler = global_clock_event->event_handler; global_clock_event->event_handler = lapic_cal_handler; global_clock_event is NULL here. This is a "reduced hardware" ACPI platform so acpi_generic_reduced_hw_init() has set timer_init to NULL, avoiding the usual codepaths that would set up global_clock_event. I tried the obvious: if (!global_clock_event) return -1; However I'm probably missing part of the big picture here, as this only makes boot fail later on. It continues til the next point that something leads to schedule(), such as a driver calling msleep() or mark_readonly() calling rcu_barrier(), etc. Then it hangs. Is something missing in terms of timer setup here? Suggestions appreciated... Thanks Daniel
Re: [PATCH AUTOSEL 5.1 013/219] x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
Hi, On Mon, Jul 15, 2019 at 9:38 PM Sasha Levin wrote: > From: Daniel Drake > > [ Upstream commit 604dc9170f2435d27da5039a3efd757dceadc684 ] In my opinion this is not stable kernel material. It alone does not solve a particular bug. It's part of a larger effort to decrease usage of the 8254 PIT on modern platforms, which solves a real problem (see "x86: skip PIT initialization on modern chipsets"), but I'd conservatively recommend just leaving these patches out of the stable tree. The problem has existed for a while without a particularly wide impact, and there is a somewhat documented workaround of changing BIOS settings. Daniel
Re: [PATCH v2] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
On Fri, Jul 5, 2019 at 10:27 AM Chris Chiu wrote: > Per the code before REG_S0S1_PATH_SWITCH setting, the driver has told > the co-processor the antenna is inverse. > memset(, 0, sizeof(struct h2c_cmd)); > h2c.ant_sel_rsv.cmd = H2C_8723B_ANT_SEL_RSV; > h2c.ant_sel_rsv.ant_inverse = 1; > h2c.ant_sel_rsv.int_switch_type = 0; > rtl8xxxu_gen2_h2c_cmd(priv, , sizeof(h2c.ant_sel_rsv)); > > At least the current modification is consistent with the antenna > inverse setting. > I'll verify on vendor driver about when/how the inverse be determined. I checked this out. The codepath hit hardcodes it to the AUX port, i.e. "inverted" setup: EXhalbtc8723b1ant_PowerOnSetting(): if(pBtCoexist->chipInterface == BTC_INTF_USB) { // fixed at S0 for USB interface pBtCoexist->fBtcWrite4Byte(pBtCoexist, 0x948, 0x0); u1Tmp |= 0x1;// antenna inverse pBtCoexist->fBtcWriteLocalReg1Byte(pBtCoexist, 0xfe08, u1Tmp); pBoardInfo->btdmAntPos = BTC_ANTENNA_AT_AUX_PORT; } So I'm further convinced that these performance-enhancing changes are increasing consistency with the vendor driver. Daniel
Re: [PATCH] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
On Wed, Jul 3, 2019 at 8:59 PM Jes Sorensen wrote: > My point is this seems to be very dongle dependent :( We have to be > careful not breaking it for some users while fixing it for others. Do you still have your device? Once we get to the point when you are happy with Chris's two patches here on a code review level, we'll reach out to other driver contributors plus people who previously complained about these types of problems, and see if we can get some wider testing. Larry, do you have these devices, can you help with testing too? Thanks Daniel
Re: [PATCH v2] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
On Thu, Jul 4, 2019 at 6:55 PM Chris Chiu wrote: > The WiFi tx power of RTL8723BU is extremely low after booting. So > the WiFi scan gives very limited AP list and it always fails to > connect to the selected AP. This module only supports 1x1 antenna > and the antenna is switched to bluetooth due to some incorrect > register settings. > > Compare with the vendor driver https://github.com/lwfinger/rtl8723bu, > we realized that the 8723bu's enable_rf() does the same thing as > rtw_btcoex_HAL_Initialize() in vendor driver. And it by default > sets the antenna path to BTC_ANT_PATH_BT which we verified it's > the cause of the wifi weak tx power. The vendor driver will set > the antenna path to BTC_ANT_PATH_PTA in the consequent btcoexist > mechanism, by the function halbtc8723b1ant_PsTdma. Checking these details in the vendor driver: EXhalbtc8723b1ant_PowerOnSetting sets: pBoardInfo->btdmAntPos = BTC_ANTENNA_AT_AUX_PORT; Following the code flow from rtw_btcoex_HAL_Initialize(), this has a bWifiOnly parameter which will ultimately influence the final register value. Continuing the flow, we reach halbtc8723b1ant_SetAntPath() with bInitHwCfg=TRUE, bWifiOff=FALSE. antPosType is set to WIFI in the bWifiOnly case, and BT otherwise. I'm assuming that bUseExtSwitch = FALSE (existing rtl8xxxu code also seems to make the same assumption). For the bWifiOnly=FALSE case, it uses BT, pBtCoexist->fBtcWrite4Byte(pBtCoexist, 0x948, 0x0); and rtl8xxxu seems to do the same - seemingly routing the antenna path for BT only. As for halbtc8723b1ant_PsTdma() then being called in a way that causes it to switch to the PTA path a little later, it's more difficult to point out how that happens in an email, but I thin kwe can trust you on that :) There are certainly many callsites that would pass those parameters. > +* Different settings per different antenna position. > +* Antenna Position: | Normal Inverse > +* -- > +* Antenna switch to BT:| 0x280, 0x00 > +* Antenna switch to WiFi: | 0x0, 0x280 > +* Antenna switch to PTA: | 0x200, 0x80 > */ > - rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x00); > + rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x80); I don't really understand what we mean by an "inverse" antenna and my reading of the vendor driver leads me to another interpretation. The logic is based around an antenna position - btdmAntPos. It takes one of two values: BTC_ANTENNA_AT_MAIN_PORT= 0x1, BTC_ANTENNA_AT_AUX_PORT= 0x2, So the chip has pins to support two antennas - a "main" antenna and an "aux" one. We know we're dealing with a single antenna, so the actual module is going to only be using one of those antenna interfaces. If the chip tries to use the other antenna interface, it's effectively not using the antenna. So it's rather important to tell the chip to use the right interface. And that's exactly what happens here. btdmAntPos is hardcoded that the antenna is on the aux port for these devices, and this code is telling the chip that this is how things are wired up. The alternative way of calling this an antenna inverse (which the vendor driver also does in another section), i.e. "antenna is not connected to the main port but instead it's connected to the other one", seems less clear to me. Even if we don't fully understand what's going on here, I'm convinced that your code change is fixing an inconsistency with the vendor driver, and most significantly, making the signal level actually usable on our devices. But if you agree with my interpretation of these values then maybe you could update the comment here! Daniel
Re: [PATCH] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
On Tue, Jul 2, 2019 at 4:01 PM Chris Chiu wrote: > When the vendor driver invokes rtw_btcoex_HAL_Initialize, which will then > call halbtc8723b1ant_SetAntPath to configure the registers in this patch. > From the code, the registers will have different register settings per the > antenna position and the phase. If the driver is in the InitHwConfig phase, > the register value is identical to what rtl8xxxu driver does in enable_rf(). > However, the vendor driver will do halbtc8723b1ant_PsTdma() twice by > halbtc8723b1ant_ActionWifiNotConnected() with the type argument 8 for > PTA control about 200ms after InitHwConfig. The _ActionWifiNotConnected > is invoked by the BTCOEXIST. I keep seeing the halbtc8723b1ant_PsTdma > with type 8 been called every 2 seconds. I see. So this is a measured step towards consistency with the vendor driver. Maybe you can mention these details in the commit message. > Yes, it ends up with 0x0c not matter what antenna position type is. Unless > it's configured wifi only. Also worth mentioning in the commit message then, that the 0xc ACT_CONTROL value is effectively what the working vendor driver uses. > > > -* 0x280, 0x00, 0x200, 0x80 - not clear > > > +* Different settings per different antenna position. > > > +* Antenna switch to BT: 0x280, 0x00 (inverse) > > > +* Antenna switch to WiFi: 0x0, 0x280 (inverse) > > > +* Antenna controlled by PTA: 0x200, 0x80 (inverse) > > > */ > > > - rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x00); > > > + rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x80); > > > > I don't quite follow the comment here. Why are there 2 values listed > > for each possibility, what do you mean by inverse? You say the > > register settings were incorrect, but the previous value was 0x00 > > which you now document as "antenna switch to wifi" which sounds like > > it was already correct? > > > > Which value does the vendor driver use? > > > The first column means the value for normal antenna installation, wifi > on the main port. The second column is the value for inverse antenna > installation. So if I want to manually switch the antenna for BT use, > and the antenna installation is inverse, I need to set to 0x280. So 0x80 > means I want to switch to PTA and the antenna installation in inverse. Still not quite clear what you mean by "inverse" here, but maybe I just don't know anything about antennas. Is it that an antenna connector has two pins and this one swaps the meaning of each pin? Does the new value of 0x80 reflect what the vendor driver does in practice? Thanks Daniel
Re: [PATCH] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
On Tue, Jul 2, 2019 at 8:42 PM Jes Sorensen wrote: > We definitely don't want to bring over the vendor code, since it's a > pile of spaghetti, but we probably need to get something sorted. This > went down the drain when the bluetooth driver was added without taking > it into account - long after this driver was merged. Yeah, I didn't mean bring over quite so literally.. Chris is studying it and figuring out the neatest way to reimplement the required bits. As for the relationship with bluetooth.. actually the bug that Chris is working on here is that the rtl8xxxu wifi signal is totally unusable *until* the bluetooth driver is loaded. Once the bluetooth driver is loaded, at the point of bluetooth firmware upload, the rtl8xxxu signal magiaclly strength becomes good. I think this is consistent with other rtl8xxxu problem reports that we saw lying around, although they had not been diagnosed in so much detail. The rtl8723bu vendor driver does not suffer this problem, it works fine with or without the bluetooth driver in place. Daniel
Re: [PATCH] rtl8xxxu: Fix wifi low signal strength issue of RTL8723BU
Hi Chris, On Thu, Jun 27, 2019 at 5:53 PM Chris Chiu wrote: > The WiFi tx power of RTL8723BU is extremely low after booting. So > the WiFi scan gives very limited AP list and it always fails to > connect to the selected AP. This module only supports 1x1 antenna > and the antenna is switched to bluetooth due to some incorrect > register settings. > > This commit hand over the antenna control to PTA, the wifi signal > will be back to normal and the bluetooth scan can also work at the > same time. However, the btcoexist still needs to be handled under > different circumstances. If there's a BT connection established, > the wifi still fails to connect until disconneting the BT. > > Signed-off-by: Chris Chiu Really nice work finding this! I know that after this change, you plan to bring over the btcoexist code from the vendor driver (or at least the minimum required code) for a more complete fix, but I'm curious how you found these magic register values and how they compare to the values used by the vendor driver with btcoexist? What's PTA? A type of firmware-implemented btcoexist that works for scanning but doesn't work when a BT connection is actually established? > diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c > b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c > index 3adb1d3d47ac..6c3c70d93ac1 100644 > --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c > +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c > @@ -1525,7 +1525,7 @@ static void rtl8723b_enable_rf(struct rtl8xxxu_priv > *priv) > /* > * WLAN action by PTA > */ > - rtl8xxxu_write8(priv, REG_WLAN_ACT_CONTROL_8723B, 0x04); > + rtl8xxxu_write8(priv, REG_WLAN_ACT_CONTROL_8723B, 0x0c); The comment above this still says "WLAN action by PTA" and the vendor driver has: //set wlan_act control by PTA pBtCoexist->fBtcWrite1Byte(pBtCoexist, 0x76e, 0x4); but then also: //set wlan_act control by PTA pBtCoexist->fBtcWrite1Byte(pBtCoexist, 0x76e, 0xc); So this change seems to be at least consistent with ambiguity of the vendor driver, do you have any understanding of the extra bit that is now set here? It's not easy to follow the code flow of the vendor driver to see what actually happens, have you checked that, does it end up using the 0xc value? > -* 0x280, 0x00, 0x200, 0x80 - not clear > +* Different settings per different antenna position. > +* Antenna switch to BT: 0x280, 0x00 (inverse) > +* Antenna switch to WiFi: 0x0, 0x280 (inverse) > +* Antenna controlled by PTA: 0x200, 0x80 (inverse) > */ > - rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x00); > + rtl8xxxu_write32(priv, REG_S0S1_PATH_SWITCH, 0x80); I don't quite follow the comment here. Why are there 2 values listed for each possibility, what do you mean by inverse? You say the register settings were incorrect, but the previous value was 0x00 which you now document as "antenna switch to wifi" which sounds like it was already correct? Which value does the vendor driver use? > /* > * Software control, antenna at WiFi side > diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > index 8136e268b4e6..87b2179a769e 100644 > --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > @@ -3891,12 +3891,13 @@ static int rtl8xxxu_init_device(struct ieee80211_hw > *hw) > > /* Check if MAC is already powered on */ > val8 = rtl8xxxu_read8(priv, REG_CR); > + val16 = rtl8xxxu_read16(priv, REG_SYS_CLKR); > > /* > * Fix 92DU-VC S3 hang with the reason is that secondary mac is not > * initialized. First MAC returns 0xea, second MAC returns 0x00 > */ > - if (val8 == 0xea) > + if (val8 == 0xea || !(val16 & BIT(11))) > macpower = false; > else > macpower = true; At a glance I can't see which code this corresponds to in the vendor driver, can you point that out? Thanks Daniel
Re: [RFC PATCH v5] rtl8xxxu: Improve TX performance of RTL8723BU on rtl8xxxu driver
On Mon, Jun 17, 2019 at 2:56 PM Chris Chiu wrote: > With this commit, the tx rate of each data and qos data packet will > be 39Mbps (MCS4) with the 0xF0 as the tx rate mask. The 20th bit > to 23th bit means MCS4 to MCS7. It means that the firmware still picks > the lowest rate from the rate mask and explains why the tx rate of > data and qos data is always lowest 1Mbps because the default rate mask > passed is always 0xFFF ranges from the basic CCK rate, OFDM rate, > and MCS rate. However, with Realtek's driver, the tx rate observed from > wireshark under the same condition is almost 65Mbps or 72Mbps suggestion: add: , indicating that rtl8xxxu could still be further improved. Then remove this paragraph, I think we're in agreement of the approach here: > I believe the firmware of RTL8723BU may need fix. And I think we > can still bring in the dm_watchdog as rtlwifi to improve from the > driver side. Please leave precious comments for my commits and > suggest what I can do better. Or suggest if there's any better idea > to fix this. Thanks. > Signed-off-by: Chris Chiu Reviewed-by: Daniel Drake > +* is supported and no iface_combinations are providec. typo: provided
Re: [PATCH v3] Bluetooth: btrtl: HCI reset on close for Realtek BT chip
On Tue, Jun 25, 2019 at 4:32 PM Jian-Hong Pan wrote: > Realtek RTL8822BE BT chip on ASUS X420FA cannot be turned on correctly > after on-off several times. Bluetooth daemon sets BT mode failed when > this issue happens. Scanning must be active while turning off for this > bug to be hit. > > bluetoothd[1576]: Failed to set mode: Failed (0x03) > > If BT is turned off, then turned on again, it works correctly again. > > According to the vendor driver, the HCI_QUIRK_RESET_ON_CLOSE flag is set > during probing. So, this patch makes Realtek's BT reset on close to fix > this issue. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=203429 > Signed-off-by: Jian-Hong Pan Reviewed-by: Daniel Drake
Re: [PATCH] x86: skip PIT initialization on modern chipsets
Oops, please also add a reference to the main thread: On Fri, Jun 28, 2019 at 3:23 PM Daniel Drake wrote: > Detect modern setups where we have no need for the PIT (e.g. where > we already know the TSC and LAPIC timer frequencies, so no need > to calibrate them against the PIT), and skip initialization PIT in > such cases. > > Skip the IO-APIC timer-checking code when we don't have a PIT > to test against (this was causing the panic). Link: https://lore.kernel.org/lkml/CAD8Lp45fedoPLnK=umuhhtkjy5u2h04sykrx3u+m04u6fpv...@mail.gmail.com > Tested-by: Daniel Drake
[PATCH] x86: skip PIT initialization on modern chipsets
From: Thomas Gleixner Recent Intel chipsets including Skylake and ApolloLake have a special ITSSPRC register which allows the 8254 PIT to be gated. When gated, the 8254 registers can still be programmed as normal, but there are no IRQ0 timer interrupts. Some products such as the Connex L1430 and exone go Rugged E11 use this register to ship with the PIT gated by default. This causes Linux to fail to boot: Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. The panic happens before the framebuffer is initialized, so to the user, it appears as an early boot hang on a black screen. Affected products typically have a BIOS option that can be used to enable the 8254 and make Linux work (Chipset -> South Cluster Configuration -> Miscellaneous Configuration -> 8254 Clock Gating), however it would be best to make Linux support the no-8254 case. Detect modern setups where we have no need for the PIT (e.g. where we already know the TSC and LAPIC timer frequencies, so no need to calibrate them against the PIT), and skip initialization PIT in such cases. Skip the IO-APIC timer-checking code when we don't have a PIT to test against (this was causing the panic). Tested-by: Daniel Drake --- arch/x86/include/asm/apic.h| 2 ++ arch/x86/include/asm/time.h| 1 + arch/x86/kernel/apic/apic.c| 27 +++ arch/x86/kernel/apic/io_apic.c | 4 arch/x86/kernel/i8253.c| 25 - arch/x86/kernel/time.c | 7 +-- 6 files changed, 63 insertions(+), 3 deletions(-) Thomas, please add your sign off before pushing. Depends on commits in tip.git x86/apic x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period' x86/tsc: Set LAPIC timer period to crystal clock frequency diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 71026eba9ba4..3e66b82454ce 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -175,6 +175,7 @@ extern void lapic_assign_system_vectors(void); extern void lapic_assign_legacy_vector(unsigned int isairq, bool replace); extern void lapic_online(void); extern void lapic_offline(void); +extern bool apic_needs_pit(void); #else /* !CONFIG_X86_LOCAL_APIC */ static inline void lapic_shutdown(void) { } @@ -188,6 +189,7 @@ static inline void init_bsp_APIC(void) { } static inline void apic_intr_mode_init(void) { } static inline void lapic_assign_system_vectors(void) { } static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { } +static inline bool apic_needs_pit(void) { return true; } #endif /* !CONFIG_X86_LOCAL_APIC */ #ifdef CONFIG_X86_X2APIC diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h index cef818b16045..8ac563abb567 100644 --- a/arch/x86/include/asm/time.h +++ b/arch/x86/include/asm/time.h @@ -7,6 +7,7 @@ extern void hpet_time_init(void); extern void time_init(void); +extern bool pit_timer_init(void); extern struct clock_event_device *global_clock_event; diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 8956072f677d..abe67c3ae071 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -821,6 +821,33 @@ static int __init lapic_init_clockevent(void) return 0; } +bool __init apic_needs_pit(void) +{ + /* +* If the frequencies are not known, PIT is required for both TSC +* and apic timer calibration. +*/ + if (!tsc_khz || !cpu_khz) + return true; + + /* Is there an APIC at all? */ + if (!boot_cpu_has(X86_FEATURE_APIC)) + return true; + + /* Deadline timer is based on TSC so no further PIT action required */ + if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) + return false; + + /* APIC timer disabled? */ + if (disable_apic_timer) + return true; + /* +* The APIC timer frequency is known already, no PIT calibration +* required. If unknown, let the PIT be initialized. +*/ + return lapic_timer_period == 0; +} + static int __init calibrate_APIC_clock(void) { struct clock_event_device *levt = this_cpu_ptr(_events); diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 53aa234a6803..1bb864798800 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -58,6 +58,7 @@ #include #include #include +#include #include #include #include @@ -2083,6 +2084,9 @@ static inline void __init check_timer(void) unsigned long flags; int no_pin1 = 0; + if (!global_clock_event) + return; + local_irq_save(flags); /* diff --git a/arch/x86/kernel/i8253.c b/arch/x86/kernel/i8253.c index 0d307a657abb..2b7999a1a50a 100644 --- a/arch/x86/kernel/i8253.c +++ b/arch/x86/kernel/i8253.
Re: No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot
On Thu, Jun 27, 2019 at 10:07 PM Thomas Gleixner wrote: > Nah. That extra timer works thing is just another bandaid. > > What I had in mind is something like the below. That's on top of > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic > > Be warned. It's neither compiled nor tested, so keep a fire extinguisher > handy. If it explodes you own the pieces. Thanks, it works, and makes sense! I'll add a commit message and send it as a patch, just one quick function naming question... did you mean apic_and_tsc_needs_pit() or apic_needs_pit()? That's the only compile fix needed. Daniel
Re: No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot
Hi Thomas, Picking up this issue again after a break! We made some progress last time on reducing PIT usage in the TSC calibration code, but we still have the bigger issue to resolve: IO-APIC code panicing when the PIT isn't ticking. On Wed, Apr 3, 2019 at 7:21 PM Thomas Gleixner wrote: > For newer CPUs we might assume that: > > 1) The TSC and APIC timer are actually usable > > 2) The frequencies can be retrieved from CPUID or MSRs > > If #1 and #2 are reliable we can avoid the whole calibration and interrupt > delivery mess. > > That means we need the following decision logic: > > 1) If HPET is available in ACPI, boot normal. > > 2) If HPET is not available, verify that the PIT actually counts. If it > does, boot normal. > > If it does not either: > > 2A) Verify that this is a PCH 300/C240 and fiddle with that ISST bit. > > But that means that we need to chase PCH ids forever... > > 2B) Shrug and just avoid the whole PIT/HPET magic all over the place: > > - Avoid the interrupt delivery check in the IOAPIC code as it's >uninteresting in that case. Trivial to do. I tried to explore this idea here: https://lore.kernel.org/patchwork/patch/1064972/ https://lore.kernel.org/patchwork/patch/1064971/ But I can't say I really knew what I was doing there, and you pointed out some problems. Any new ideas that I can experiment with? Being more conservative, how about something like this? --- arch/x86/kernel/apic/io_apic.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 53aa234a6803..36b1e7d5b657 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2073,7 +2073,7 @@ static int mp_alloc_timer_irq(int ioapic, int pin) * * FIXME: really need to revamp this for all platforms. */ -static inline void __init check_timer(void) +static inline void __init check_timer(int timer_was_working) { struct irq_data *irq_data = irq_get_irq_data(0); struct mp_chip_data *data = irq_data->chip_data; @@ -2216,8 +2216,15 @@ static inline void __init check_timer(void) apic_printk(APIC_QUIET, KERN_INFO "Perhaps problem with the pre-enabled x2apic mode\n" "Try booting with x2apic and interrupt-remapping disabled in the bios.\n"); - panic("IO-APIC + timer doesn't work! Boot with apic=debug and send a " - "report. Then try booting with the 'noapic' option.\n"); + + if (timer_was_working) + panic("IO-APIC + timer doesn't work! Boot with apic=debug and send a " + "report. Then try booting with the 'noapic' option.\n"); + else + apic_printk(APIC_QUIET, KERN_INFO + "Continuing anyway with no 8254 timer, as it was not working even before IO-APIC setup was attempted.\n" + "Boot will fail unless another working clocksource is found.\n"); + out: local_irq_restore(flags); } @@ -2304,12 +2311,20 @@ static void ioapic_destroy_irqdomain(int idx) void __init setup_IO_APIC(void) { int ioapic; + int timer_was_working = 0; if (skip_ioapic_setup || !nr_ioapics) return; io_apic_irqs = nr_legacy_irqs() ? ~PIC_IRQS : ~0UL; + /* +* Record if the timer was in working state before we do any +* IO-APIC setup. +*/ + if (nr_legacy_irqs()) + timer_was_working = timer_irq_works(); + apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n"); for_each_ioapic(ioapic) BUG_ON(mp_irqdomain_create(ioapic)); @@ -2323,7 +2338,7 @@ void __init setup_IO_APIC(void) setup_IO_APIC_irqs(); init_IO_APIC_traps(); if (nr_legacy_irqs()) - check_timer(); + check_timer(timer_was_working); ioapic_initialized = 1; } -- 2.20.1
Re: [PATCH v2] Bluetooth: btrtl: HCI reset on close for Realtek BT chip
On Mon, Jun 24, 2019 at 2:24 PM Jian-Hong Pan wrote: > Realtek RTL8822BE BT chip on ASUS X420FA cannot be turned on correctly > after on-off several times. Bluetooth daemon sets BT mode failed when > this issue happens. You could also mention that scanning must be active while turning off for this bug to be hit. > bluetoothd[1576]: Failed to set mode: Failed (0x03) > > If BT is tunred off, then turned on again, it works correctly again. Typo: turned > According to the vendor driver, the HCI_QUIRK_RESET_ON_CLOSE flag is set > during probing. So, this patch makes Realtek's BT reset on close to fix > this issue. Checked the vendor driver - I see what you are referring to, so the change seems correct. #if HCI_VERSION_CODE >= KERNEL_VERSION(3, 7, 1) if (!reset) set_bit(HCI_QUIRK_RESET_ON_CLOSE, >quirks); RTKBT_DBG("set_bit(HCI_QUIRK_RESET_ON_CLOSE, >quirks);"); #endif However I'm pretty sure this is not saying that kernel 3.7.0 did not need the reset. I think it just means that the flag did not exist before Linux-3.7.1, so they added the ifdef to add some level of compatibility with older kernel versions. I think you can remove "since kernel v3.7.1." from the comment. After those changes you can add: Link: https://bugzilla.kernel.org/show_bug.cgi?id=203429 Reviewed-by: Daniel Drake > Signed-off-by: Jian-Hong Pan > --- > v2: > - According to the vendor driver, it makes "all" Realtek's BT reset on >close. So, this version makes it the same. > - Change to the new subject for all Realtek BT chips. > > drivers/bluetooth/btrtl.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/bluetooth/btrtl.c b/drivers/bluetooth/btrtl.c > index 208feef63de4..be6d5f7e1e44 100644 > --- a/drivers/bluetooth/btrtl.c > +++ b/drivers/bluetooth/btrtl.c > @@ -630,6 +630,10 @@ int btrtl_setup_realtek(struct hci_dev *hdev) > return PTR_ERR(btrtl_dev); > > ret = btrtl_download_firmware(hdev, btrtl_dev); > + /* According to the vendor driver, BT must be reset on close to avoid > +* firmware crash since kernel v3.7.1. > +*/ > + set_bit(HCI_QUIRK_RESET_ON_CLOSE, >quirks); > > btrtl_free(btrtl_dev); > > -- > 2.22.0 >
Re: [PATCH v2 2/5] nvme: rename "pci" operations to "mmio"
On Mon, Jun 24, 2019 at 2:16 PM Christoph Hellwig wrote: > IFF we want to support it it has to be done at the PCIe layer. But > even that will require actual documentation and support from Intel. > > If Intel still believes this scheme is their magic secret to control > the NVMe market and give themselves and unfair advantage over their > competitors there is not much we can do. Since the 2016 discussion, more documentation has been published: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf Chapter 15 is entirely new, and section 15.2 provides a nice clarity improvement of the magic regs in the AHCI BAR, which I have used in these patches to clean up the code and add documentation in the header (see patch 1 in this series, ahci-remap.h). I believe there's room for further improvement in the docs here, but it would be nice to know what you see as the blocking questions or documentation gaps that would prevent us from continuing to develop the fake PCI bridge approach (https://marc.info/?l=linux-pci=156015271021614=2). We are going to try and push Intel on this via other channels to see if we can get a contact to help us, so it would be useful if I can include a concrete list of what we need. Bearing in mind that we've already been told that the NVMe device config space is inaccessible, and the new docs show exactly how the BIOS enforces such inaccessibility during early boot, the remaining points you mentioned recently were: b) reset handling, including the PCI device removal as the last escalation step c) SR-IOV VFs and their management d) power management Are there other blocking questions you would require answers to? Thanks, Daniel
Re: [PATCH] Bluetooth: btrtl: HCI reset on close for RTL8822BE
Hi Jian-Hong, On Fri, Jun 21, 2019 at 4:59 PM Jian-Hong Pan wrote: > Realtek RTL8822BE BT chip on ASUS X420FA cannot be turned on correctly > after on-off several times. Bluetooth daemon sets BT mode failed when > this issue happens. > > bluetoothd[1576]: Failed to set mode: Failed (0x03) > > If BT is tunred off, then turned on again, it works correctly again. > This patch makes RTL8822BE BT reset on close to fix this issue. I know we've been trying to understand why Realtek's own bluetooth driver avoids this bug, but is this solution based upon code in the vendor driver? At a glance I can't see the flag (or equivalent) being set there. Daniel
Re: [PATCH v2 2/5] nvme: rename "pci" operations to "mmio"
On Thu, Jun 20, 2019 at 2:11 PM Christoph Hellwig wrote: > The Linux NVMe driver will deal with NVMe as specified plus whatever > minor tweaks we'll need for small bugs. Hiding it behind an AHCI > device is completely out of scope and will not be accepted. Do you have any new suggestions for alternative ways we can implement support for this storage configuration? I tried to follow your earlier suggestions regarding faking a PCI bus here: (or let me know if you had something different in mind...) https://marc.info/?l=linux-pci=156015271021614=2 but it looks like that is not going to fly either :( Daniel
[PATCH v2 4/5] nvme: move common definitions to pci.h
From: Dan Williams A platform-driver for nvme resources needs access to struct nvme_dev and other definitions that are currently local to pci.c. Signed-off-by: Dan Williams Signed-off-by: Daniel Drake --- drivers/nvme/host/pci.c | 125 +--- drivers/nvme/host/pci.h | 136 2 files changed, 137 insertions(+), 124 deletions(-) create mode 100644 drivers/nvme/host/pci.h diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 23bda524f16b..bed6c91b6b7c 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -26,6 +26,7 @@ #include "trace.h" #include "nvme.h" +#include "pci.h" #define SQ_SIZE(depth) (depth * sizeof(struct nvme_command)) #define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion)) @@ -83,97 +84,9 @@ static int poll_queues = 0; module_param_cb(poll_queues, _count_ops, _queues, 0644); MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO."); -struct nvme_dev; -struct nvme_queue; - static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); -struct nvme_dev_ops { - /* Enable device (required) */ - int (*enable)(struct nvme_dev *dev); - - /* Disable device (required) */ - void (*disable)(struct nvme_dev *dev); - - /* Allocate IRQ vectors for given number of io queues (required) */ - int (*setup_irqs)(struct nvme_dev *dev, int nr_io_queues); - - /* Get the IRQ vector for a specific queue */ - int (*q_irq)(struct nvme_queue *q); - - /* Allocate device-specific SQ command buffer (optional) */ - int (*cmb_alloc_sq_cmds)(struct nvme_queue *nvmeq, size_t size, -struct nvme_command **sq_cmds, -dma_addr_t *sq_dma_addr); - - /* Free device-specific SQ command buffer (optional) */ - void (*cmb_free_sq_cmds)(struct nvme_queue *nvmeq, -struct nvme_command *sq_cmds, size_t size); - - /* Device-specific mapping of blk queues to CPUs (optional) */ - int (*map_queues)(struct nvme_dev *dev, struct blk_mq_queue_map *map, - int offset); - - /* Check if device is enabled on the bus (required) */ - int (*is_enabled)(struct nvme_dev *dev); - - /* Check if channel is in running state (required) */ - int (*is_offline)(struct nvme_dev *dev); - - /* Check if device is present and responding (optional) */ - bool (*is_present)(struct nvme_dev *dev); - - /* Check & log device state before it gets reset (optional) */ - void (*warn_reset)(struct nvme_dev *dev); -}; - -/* - * Represents an NVM Express device. Each nvme_dev is a PCI function. - */ -struct nvme_dev { - const struct resource *res; - const struct nvme_dev_ops *ops; - struct nvme_queue *queues; - struct blk_mq_tag_set tagset; - struct blk_mq_tag_set admin_tagset; - u32 __iomem *dbs; - struct device *dev; - struct dma_pool *prp_page_pool; - struct dma_pool *prp_small_pool; - unsigned online_queues; - unsigned max_qid; - unsigned io_queues[HCTX_MAX_TYPES]; - unsigned int num_vecs; - int q_depth; - u32 db_stride; - void __iomem *bar; - unsigned long bar_mapped_size; - struct work_struct remove_work; - struct mutex shutdown_lock; - bool subsystem; - u64 cmb_size; - bool cmb_use_sqes; - u32 cmbsz; - u32 cmbloc; - struct nvme_ctrl ctrl; - - mempool_t *iod_mempool; - - /* shadow doorbell buffer support: */ - u32 *dbbuf_dbs; - dma_addr_t dbbuf_dbs_dma_addr; - u32 *dbbuf_eis; - dma_addr_t dbbuf_eis_dma_addr; - - /* host memory buffer support: */ - u64 host_mem_size; - u32 nr_host_mem_descs; - dma_addr_t host_mem_descs_dma; - struct nvme_host_mem_buf_desc *host_mem_descs; - void **host_mem_desc_bufs; -}; - static int io_queue_depth_set(const char *val, const struct kernel_param *kp) { int n = 0, ret; @@ -213,42 +126,6 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) return container_of(ctrl, struct nvme_dev, ctrl); } -/* - * An NVM Express queue. Each device has at least two (one for admin - * commands and one for I/O commands). - */ -struct nvme_queue { - struct nvme_dev *dev; - char irqname[24]; /* nvme4294967295-65535\0 */ - spinlock_t sq_lock; - struct nvme_command *sq_cmds; -/* only used for poll queues: */ - spinlock_t cq_poll_lock cacheline_aligned_in_smp; - volatile struct nvme_completion *cqes; - struct blk_mq_tags **tags; - dma_addr_t sq_dma_addr; - dma_addr_t cq_dma_addr; - u32 __iomem *q_db; - u16
[PATCH v2 5/5] nvme: Intel AHCI remap support
Provide a platform driver for the nvme resources that may be remapped behind an ahci bar on common Intel platforms. The implementation relies on the standard nvme driver, but reimplements the nvme_dev_ops accordingly. As the original NVMe PCI device is inaccessible, this driver is somewhat limited: we always assume the device is present & online, can't detect PCI errors, can't reset, power management is limited, etc. A single shared legacy interrupt is used, although there is some hope that MSI-X support could be added later. Based on previous code by Dan Williams. Signed-off-by: Daniel Drake --- drivers/ata/Kconfig | 1 + drivers/nvme/host/Kconfig| 3 + drivers/nvme/host/Makefile | 3 + drivers/nvme/host/intel-ahci-remap.c | 185 +++ drivers/nvme/host/pci.c | 21 +-- drivers/nvme/host/pci.h | 9 ++ 6 files changed, 214 insertions(+), 8 deletions(-) create mode 100644 drivers/nvme/host/intel-ahci-remap.c diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig index 6e82d66d7516..fb64e690d325 100644 --- a/drivers/ata/Kconfig +++ b/drivers/ata/Kconfig @@ -113,6 +113,7 @@ config SATA_AHCI_INTEL_NVME_REMAP bool "AHCI: Intel Remapped NVMe device support" depends on SATA_AHCI depends on BLK_DEV_NVME + select NVME_INTEL_AHCI_REMAP help Support access to remapped NVMe devices that appear in AHCI PCI memory space. diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index ec43ac9199e2..a8aefb18eb15 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -26,6 +26,9 @@ config NVME_MULTIPATH config NVME_FABRICS tristate +config NVME_INTEL_AHCI_REMAP + tristate + config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index 8a4b671c5f0c..2010169880b7 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o obj-$(CONFIG_NVME_RDMA)+= nvme-rdma.o obj-$(CONFIG_NVME_FC) += nvme-fc.o obj-$(CONFIG_NVME_TCP) += nvme-tcp.o +obj-$(CONFIG_NVME_INTEL_AHCI_REMAP)+= nvme-intel-ahci-remap.o nvme-core-y:= core.o nvme-core-$(CONFIG_TRACING)+= trace.o @@ -24,3 +25,5 @@ nvme-rdma-y += rdma.o nvme-fc-y += fc.o nvme-tcp-y += tcp.o + +nvme-intel-ahci-remap-y+= intel-ahci-remap.o diff --git a/drivers/nvme/host/intel-ahci-remap.c b/drivers/nvme/host/intel-ahci-remap.c new file mode 100644 index ..7194d9dd0016 --- /dev/null +++ b/drivers/nvme/host/intel-ahci-remap.c @@ -0,0 +1,185 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Intel AHCI remapped NVMe platform driver + * + * Copyright (c) 2011-2016, Intel Corporation. + * Copyright (c) 2019, Endless Mobile, Inc. + * + * Support platform devices created by the ahci driver, corresponding to + * NVMe devices that have been remapped into the ahci device memory space. + * + * This scheme is rather peculiar, as NVMe is inherently based on PCIe, + * however we only have access to the NVMe device MMIO space and an + * interrupt. Without access to the pci_device, many features are + * unavailable; this driver only intends to offer basic functionality. + */ + +#include +#include +#include +#include +#include "pci.h" + +struct ahci_remap_data { + atomic_t enabled; +}; + +static struct ahci_remap_data *to_ahci_remap_data(struct nvme_dev *dev) +{ + return dev->dev->platform_data; +} + +static int ahci_remap_enable(struct nvme_dev *dev) +{ + int rc; + struct resource *res; + struct device *ddev = dev->dev; + struct device *parent = ddev->parent; + struct ahci_remap_data *adata = to_ahci_remap_data(dev); + struct platform_device *pdev = to_platform_device(ddev); + + res = platform_get_resource(pdev, IORESOURCE_IRQ, 0); + if (!res) + return -ENXIO; + + /* parent ahci device determines the dma mask */ + if (dma_supported(parent, DMA_BIT_MASK(64))) + rc = dma_coerce_mask_and_coherent(ddev, DMA_BIT_MASK(64)); + else if (dma_supported(parent, DMA_BIT_MASK(32))) + rc = dma_coerce_mask_and_coherent(ddev, DMA_BIT_MASK(32)); + else + rc = -ENXIO; + if (rc) + return rc; + + rc = nvme_enable(dev); + if (rc) + return rc; + + atomic_inc(>enabled); + + return 0; +} + +static int ahci_remap_is_enabled(struct nvme_dev *dev) +{ + struct ahci_remap_data *adata = to_ahci_remap_da
[PATCH v2 2/5] nvme: rename "pci" operations to "mmio"
From: Dan Williams In preparation for adding a platform_device nvme host, rename to a more generic "mmio" prefix. Signed-off-by: Dan Williams Signed-off-by: Daniel Drake --- drivers/nvme/host/pci.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 524d6bd6d095..42990b93349d 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1108,7 +1108,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx) return found; } -static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl) +static void nvme_mmio_submit_async_event(struct nvme_ctrl *ctrl) { struct nvme_dev *dev = to_nvme_dev(ctrl); struct nvme_queue *nvmeq = >queues[0]; @@ -2448,7 +2448,7 @@ static void nvme_release_prp_pools(struct nvme_dev *dev) dma_pool_destroy(dev->prp_small_pool); } -static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) +static void nvme_mmio_free_ctrl(struct nvme_ctrl *ctrl) { struct nvme_dev *dev = to_nvme_dev(ctrl); @@ -2610,42 +2610,42 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work) nvme_put_ctrl(>ctrl); } -static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) +static int nvme_mmio_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val) { *val = readl(to_nvme_dev(ctrl)->bar + off); return 0; } -static int nvme_pci_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) +static int nvme_mmio_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) { writel(val, to_nvme_dev(ctrl)->bar + off); return 0; } -static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) +static int nvme_mmio_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) { *val = readq(to_nvme_dev(ctrl)->bar + off); return 0; } -static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size) +static int nvme_mmio_get_address(struct nvme_ctrl *ctrl, char *buf, int size) { struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev); return snprintf(buf, size, "%s", dev_name(>dev)); } -static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { +static const struct nvme_ctrl_ops nvme_mmio_ctrl_ops = { .name = "pcie", .module = THIS_MODULE, .flags = NVME_F_METADATA_SUPPORTED | NVME_F_PCI_P2PDMA, - .reg_read32 = nvme_pci_reg_read32, - .reg_write32= nvme_pci_reg_write32, - .reg_read64 = nvme_pci_reg_read64, - .free_ctrl = nvme_pci_free_ctrl, - .submit_async_event = nvme_pci_submit_async_event, - .get_address= nvme_pci_get_address, + .reg_read32 = nvme_mmio_reg_read32, + .reg_write32= nvme_mmio_reg_write32, + .reg_read64 = nvme_mmio_reg_read64, + .free_ctrl = nvme_mmio_free_ctrl, + .submit_async_event = nvme_mmio_submit_async_event, + .get_address= nvme_mmio_get_address, }; static int nvme_dev_map(struct nvme_dev *dev) @@ -2758,7 +2758,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto release_pools; } - result = nvme_init_ctrl(>ctrl, >dev, _pci_ctrl_ops, + result = nvme_init_ctrl(>ctrl, >dev, _mmio_ctrl_ops, quirks); if (result) goto release_mempool; -- 2.20.1
[PATCH v2 1/5] ahci: Discover Intel remapped NVMe devices
Intel SATA AHCI controllers support a strange mode where NVMe devices disappear from the PCI bus, and instead are remapped into AHCI PCI memory space. Many current and upcoming consumer products ship with the AHCI controller in this "RAID" or "Intel RST Premium with Intel Optane System Acceleration" mode by default. Without Linux support for this remapped mode, the default out-of-the-box experience is that the NVMe storage device is inaccessible (which in many cases is the only internal storage device). Using partial information provided by Intel in datasheets, emails, and previous patches, extend the AHCI driver to detect the remapped NVMe devices and create corresponding platform devices, to be picked up by the nvme driver. Our knowledge of the design and workings of this remapping scheme has been collected in ahci-remap.h, which can be considered the best spec we have at the moment. Based on earlier work by Dan Williams. Signed-off-by: Daniel Drake --- drivers/ata/Kconfig| 32 +++ drivers/ata/ahci.c | 173 - drivers/ata/ahci.h | 14 +++ include/linux/ahci-remap.h | 140 +++--- 4 files changed, 329 insertions(+), 30 deletions(-) diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig index a6beb2c5a692..6e82d66d7516 100644 --- a/drivers/ata/Kconfig +++ b/drivers/ata/Kconfig @@ -109,6 +109,38 @@ config SATA_MOBILE_LPM_POLICY Note "Minimum power" is known to cause issues, including disk corruption, with some disks and should not be used. +config SATA_AHCI_INTEL_NVME_REMAP + bool "AHCI: Intel Remapped NVMe device support" + depends on SATA_AHCI + depends on BLK_DEV_NVME + help + Support access to remapped NVMe devices that appear in AHCI PCI + memory space. + + You'll need this in order to access your NVMe storage if you are + running an Intel AHCI controller in "RAID" or "Intel RST Premium + with Intel Optane System Acceleration" mode. This is the default + configuration of many consumer products. If you have storage devices + being affected by this, you'll have noticed that such devices are + absent, and you'll see a warning in your kernel logs about remapped + NVMe devices. + + Instead of enabling this option, it is recommended to go into the + BIOS menu and change the SATA device into "AHCI" mode in order to + gain access to the affected devices, while also enjoying all + available NVMe features and performance. + + However, if you do want to access the NVMe devices in remapped + mode, say Y. Negative consequences of remapped device access + include: + - No NVMe device power management + - No NVMe reset support + - No NVMe quirks based on PCI ID + - No SR-IOV VFs + - Reduced performance through a shared, legacy interrupt + + If unsure, say N. + config SATA_AHCI_PLATFORM tristate "Platform AHCI SATA support" help diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index f7652baa6337..b58316347539 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -1499,11 +1500,11 @@ static irqreturn_t ahci_thunderx_irq_handler(int irq, void *dev_instance) } #endif -static void ahci_remap_check(struct pci_dev *pdev, int bar, +static int ahci_remap_init(struct pci_dev *pdev, int bar, struct ahci_host_priv *hpriv) { int i, count = 0; - u32 cap; + u32 supported_devs; /* * Check if this device might have remapped nvme devices. @@ -1511,33 +1512,68 @@ static void ahci_remap_check(struct pci_dev *pdev, int bar, if (pdev->vendor != PCI_VENDOR_ID_INTEL || pci_resource_len(pdev, bar) < SZ_512K || bar != AHCI_PCI_BAR_STANDARD || - !(readl(hpriv->mmio + AHCI_VSCAP) & 1)) - return; + !(readl(hpriv->mmio + AHCI_VS_CAP) & AHCI_VS_CAP_NRMBE)) + return -ENODEV; - cap = readq(hpriv->mmio + AHCI_REMAP_CAP); - for (i = 0; i < AHCI_MAX_REMAP; i++) { - if ((cap & (1 << i)) == 0) + supported_devs = readl(hpriv->mmio + AHCI_REMAP_RCR_L) +& AHCI_REMAP_RCR_L_NRS_MASK; + for_each_set_bit(i, (unsigned long *)_devs, AHCI_MAX_REMAP) { + struct ahci_remap *rdev; + u32 dcc; + + /* Check that the remapped device is NVMe */ + dcc = readl(hpriv->mmio + ahci_remap_dcc(i)); + if ((dcc & AHCI_REMAP_DCC_DT) != AHCI_REMAP_DCC_DT_NVME) continue; - if (readl(hpriv->mm
[PATCH v2 3/5] nvme: introduce nvme_dev_ops
In preparation for a platform device nvme driver, move the bus specific portions of nvme to nvme_dev_ops, or otherwise rewrite routines to use a generic 'struct device' instead of 'struct pci_dev'. Based on earlier work by Dan Williams. Signed-off-by: Daniel Drake --- drivers/nvme/host/pci.c | 410 +++- 1 file changed, 275 insertions(+), 135 deletions(-) I took Dan William's earlier patch here and refreshed it for the latest nvme driver, which has gained a few more places where it uses the PCI device, so nvme_dev_ops grew a bit more. Is this a suitable way of handling this case? It feels a little unclean to have both the NVMe host layer and the PCI-specific dev ops in the same file. Maybe it makes sense because NVMe is inherently a PCI thing under normal circumstances? Or would it be cleaner for me to rename "pci.c" to "mmio.c" and then separate the pci dev ops into a new "pci.c"? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 42990b93349d..23bda524f16b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -89,10 +89,51 @@ struct nvme_queue; static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode); +struct nvme_dev_ops { + /* Enable device (required) */ + int (*enable)(struct nvme_dev *dev); + + /* Disable device (required) */ + void (*disable)(struct nvme_dev *dev); + + /* Allocate IRQ vectors for given number of io queues (required) */ + int (*setup_irqs)(struct nvme_dev *dev, int nr_io_queues); + + /* Get the IRQ vector for a specific queue */ + int (*q_irq)(struct nvme_queue *q); + + /* Allocate device-specific SQ command buffer (optional) */ + int (*cmb_alloc_sq_cmds)(struct nvme_queue *nvmeq, size_t size, +struct nvme_command **sq_cmds, +dma_addr_t *sq_dma_addr); + + /* Free device-specific SQ command buffer (optional) */ + void (*cmb_free_sq_cmds)(struct nvme_queue *nvmeq, +struct nvme_command *sq_cmds, size_t size); + + /* Device-specific mapping of blk queues to CPUs (optional) */ + int (*map_queues)(struct nvme_dev *dev, struct blk_mq_queue_map *map, + int offset); + + /* Check if device is enabled on the bus (required) */ + int (*is_enabled)(struct nvme_dev *dev); + + /* Check if channel is in running state (required) */ + int (*is_offline)(struct nvme_dev *dev); + + /* Check if device is present and responding (optional) */ + bool (*is_present)(struct nvme_dev *dev); + + /* Check & log device state before it gets reset (optional) */ + void (*warn_reset)(struct nvme_dev *dev); +}; + /* * Represents an NVM Express device. Each nvme_dev is a PCI function. */ struct nvme_dev { + const struct resource *res; + const struct nvme_dev_ops *ops; struct nvme_queue *queues; struct blk_mq_tag_set tagset; struct blk_mq_tag_set admin_tagset; @@ -178,6 +219,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) */ struct nvme_queue { struct nvme_dev *dev; + char irqname[24]; /* nvme4294967295-65535\0 */ spinlock_t sq_lock; struct nvme_command *sq_cmds; /* only used for poll queues: */ @@ -384,6 +426,11 @@ static unsigned int nvme_pci_iod_alloc_size(struct nvme_dev *dev, return alloc_size + sizeof(struct scatterlist) * nseg; } +static int nvme_pci_q_irq(struct nvme_queue *nvmeq) +{ + return pci_irq_vector(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector); +} + static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned int hctx_idx) { @@ -444,7 +491,14 @@ static int queue_irq_offset(struct nvme_dev *dev) return 0; } -static int nvme_pci_map_queues(struct blk_mq_tag_set *set) +static int nvme_pci_map_queues(struct nvme_dev *dev, + struct blk_mq_queue_map *map, + int offset) +{ + return blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); +} + +static int nvme_map_queues(struct blk_mq_tag_set *set) { struct nvme_dev *dev = set->driver_data; int i, qoff, offset; @@ -464,8 +518,8 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set) * affinity), so use the regular blk-mq cpu mapping */ map->queue_offset = qoff; - if (i != HCTX_TYPE_POLL && offset) - blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); + if (i != HCTX_TYPE_POLL && offset && dev->ops->map_queues) + dev->ops->map_queues(dev, map, offset);
[PATCH v2 0/5] Support Intel AHCI remapped NVMe devices
Intel SATA AHCI controllers support a strange mode where NVMe devices disappear from the PCI bus, and instead are remapped into AHCI PCI memory space. Many current and upcoming consumer products ship with the AHCI controller in this "RAID" or "Intel RST Premium with Intel Optane System Acceleration" mode by default. Without Linux support for this remapped mode, the default out-of-the-box experience is that the NVMe storage device is inaccessible (which in many cases is the only internal storage device). In most cases, the SATA configuration can be changed in the BIOS menu to "AHCI", resulting in the AHCI & NVMe devices appearing as separate devices as you would ordinarily expect. Changing this configuration is the recommendation for power users because there are several limitations of the remapped mode (now documented in Kconfig help text). However, it's also important to support the remapped mode given that it is an increasingly common product default. We cannot expect ordinary users of consumer PCs to find out about this situation and then confidently go into the BIOS menu to change options. This patch set implements support for the remapped mode. v1 of these patches was originally posted by Dan Williams in 2016. https://marc.info/?l=linux-ide=147709610621480=2 Since then: - Intel stopped developing these patches & hasn't been responding to my emails on this topic. - More register documentation appeared in https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf - I tried Christoph's suggestion of exposing the devices on a fake PCI bus, hence not requiring NVMe driver changes, but Bjorn Helgaas does not think it's the right approach and instead recommends the approach taken here. https://marc.info/?l=linux-pci=156034736822205=2 - More consumer devices have appeared with this setting as the default, and with the decreasing cost of NVMe storage, it appears that a whole bunch more consumer PC products currently in development are going to ship in RAID/remapped mode, with only a single NVMe disk, which Linux will otherwise be unable to access by default. - We heard from hardware vendors that this Linux incompatibility is causing them to consider discontinuing Linux support on affected products. Changing the BIOS setting is too much of a logistics challenge. - I updated Dan's patches for current kernels. I added docs and references and incorporated the new register info. I incorporated feedback to push the recommendation that the user goes back to AHCI mode via the BIOS setting (in kernel logs and Kconfig help). And made some misc minor changes that I think are sensible. - I investigated MSI-X support. Can't quite get it working, but I'm hopeful that we can figure it out and add it later. With these patches shared I'll follow up with more details about that. With the focus on compatibility with default configuration of common consumer products, I'm hoping we could land an initial version without MSI support before tending to those complications. Dan Williams (2): nvme: rename "pci" operations to "mmio" nvme: move common definitions to pci.h Daniel Drake (3): ahci: Discover Intel remapped NVMe devices nvme: introduce nvme_dev_ops nvme: Intel AHCI remap support drivers/ata/Kconfig | 33 ++ drivers/ata/ahci.c | 173 -- drivers/ata/ahci.h | 14 + drivers/nvme/host/Kconfig| 3 + drivers/nvme/host/Makefile | 3 + drivers/nvme/host/intel-ahci-remap.c | 185 ++ drivers/nvme/host/pci.c | 490 ++- drivers/nvme/host/pci.h | 145 include/linux/ahci-remap.h | 140 +++- 9 files changed, 922 insertions(+), 264 deletions(-) create mode 100644 drivers/nvme/host/intel-ahci-remap.c create mode 100644 drivers/nvme/host/pci.h -- 2.20.1
Re: [RFC PATCH v3] rtl8xxxu: Improve TX performance of RTL8723BU on rtl8xxxu driver
On Wed, May 29, 2019 at 10:52 PM Chris Chiu wrote: > You mean moving the ratr_index to be the 4th function parameter of > update_rate_mask which has 2 candidates rtl8xxxu_update_rate_mask > and rtl8xxxu_gen2_update_rate_mask? I was planning to keep the > rtl8xxxu_update_rate_mask the same because old chips seems don't > need the rate index when invoking H2C command to change rate mask. > And rate index is not a common phrase/term for rate adaptive. Theoretically > we just need packet error rate, sgi and other factors to determine the rate > mask. This rate index seems to be only specific to newer Realtek drivers > or firmware for rate adaptive algorithm. I'd like to keep this for gen2 but > I admit it's ugly to put it in the priv structure. Any suggestion is > appreciated. I think it's cleaner to have it as a function parameter, even if the old chips don't use it. The rest of the implementation is in the core, so we aren't exactly dealing with chip-specific code here, at least the way it's currently done. The vendor driver also has it as a function parameter from what I can see. Daniel
Re: [RFC PATCH v3] rtl8xxxu: Improve TX performance of RTL8723BU on rtl8xxxu driver
Hi Chris, On Tue, May 28, 2019 at 11:03 PM Chris Chiu wrote: > + /* > +* Single virtual interface permitted since the driver supports > STATION > +* mode only. I think you can be a bit more explicit by saying e.g.: Only one virtual interface permitted because only STA mode is supported and no iface_combinations are provided. > diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > index 039e5ca9d2e4..2d612c2df5b2 100644 > --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c > @@ -4345,7 +4345,7 @@ void rtl8xxxu_gen2_update_rate_mask(struct > rtl8xxxu_priv *priv, > h2c.b_macid_cfg.ramask3 = (ramask >> 24) & 0xff; > > h2c.ramask.arg = 0x80; > - h2c.b_macid_cfg.data1 = 0; > + h2c.b_macid_cfg.data1 = priv->ratr_index; I think ratr_index can be moved to be a function parameter of the update_rate_mask function. It looks like all callsites already know which value they want to set. Then you don't have to store it in the priv structure. > @@ -5471,6 +5509,10 @@ static int rtl8xxxu_add_interface(struct ieee80211_hw > *hw, > > switch (vif->type) { > case NL80211_IFTYPE_STATION: > + if (!priv->vif) > + priv->vif = vif; > + else > + return -EOPNOTSUPP; > rtl8xxxu_stop_tx_beacon(priv); rtl8xxxu_remove_interface should also set priv->vif back to NULL. > @@ -6183,6 +6259,8 @@ static void rtl8xxxu_disconnect(struct usb_interface > *interface) > mutex_destroy(>usb_buf_mutex); > mutex_destroy(>h2c_mutex); > > + cancel_delayed_work_sync(>ra_watchdog); Given that the work was started in rtl8xxxu_start, I think it should be cancelled in rtl8xxxu_stop() instead. Daniel
Re: [RFC PATCH 2/2] rtl8xxxu: Add watchdog to update rate mask by signal strength
On Mon, May 27, 2019 at 12:38 AM Chris Chiu wrote: > The -EBUSY is returned by the ieee80211_check_combinations() in the > ieee80211_check_concurrent_iface() function which is invoked each time > doing ieee80211_open(). > The ieee80211_check_combinations() returns the -EBUSY because of > cfg80211_check_combinations() will iterate all interfaces of different types > then checks the combination is valid or not, which in this case the number > of interface combination accumulated by cfg80211_iter_sum_ifcombos is 0 > when I'm trying to bring up the second station interface. Thanks for clarifying. I see, rtl8xxxu does not provide any iface_combinations so the default is to reject parallel interfaces. Given that we can now confidently say that multiple interfaces are not supported, I think that inside rtl8xxxu_add_interface() you could store a pointer to the vif inside the rtl8xxxu_priv structure (and clear it in rtl8xxxu_remove_interface). Plus for clarity, add a comment pointing out that only a single interface is supported, and make rtl8xxxu_add_interface() acually fail if we already had a non-NULL pointer stored in the priv structure. Then you can simplify the patch to remove the ugly storing of vif inside rtl8xxxu_watchdog. You can store the delayed_work in the main priv struct too, dropping rtl8xxxu_watchdog altogether. Daniel
Re: [PATCH v4 04/13] platform/x86: wmi: Add function to get _UID of WMI device
On Tue, May 14, 2019 at 12:59 PM Yurii Pavlovskyi wrote: > > Add a new function to acpi.h / wmi.c that returns _UID of the ACPI WMI > device. For example, it returns "ATK" for the following declaration in > DSDT: > Device (ATKD) > { > Name (_HID, "PNP0C14" /* Windows Management Instrumentation Device */) > // _HID: Hardware ID > Name (_UID, "ATK") // _UID: Unique ID > .. > > Generally, it is possible that multiple PNP0C14 ACPI devices are present in > the system as mentioned in the commit message of commit bff431e49ff5 > ("ACPI: WMI: Add ACPI-WMI mapping driver"). > > Therefore the _UID is returned for a specific ACPI device that declares the > given GUID, to which it is also mapped by other methods of wmi module. > > Signed-off-by: Yurii Pavlovskyi Some extra background may be useful when reviewing this. As researched in https://marc.info/?l=linux-kernel=155498017207933=2, we are dealing with a tricky situation. asus-wmi currently serves two different classes of device: eeepci-wmi and asus-nb-wmi. The eeepci devices have: _WDG : includes a METHOD block with GUID ASUS_WMI_MGMT_GUID, and an EVENT block with GUID EEEPC_WMI_EVENT_GUID _UID : ASUSWMI The asus-nb-wmi devices have: _ WDG : includes a METHOD block with GUID ASUS_WMI_MGMT_GUID (same as eeepc), and an EVENT block with GUID ASUS_NB_WMI_EVENT_GUID _UID : ATK To support new devices we now need to start concretely identifying which of these we are working with. But complications include: - The main MGMT_GUID used for matching at the moment is shared over both device types - Some Asus products have both of these (via two separate two separate PNP0C14 WMI devices). Currently eeepci-wmi and asus-nb-wmi register themselves with asus-wmi, which registers a platform device for each one. The platform dev probe then succeeds the platform device probe when it finds any _WDG entry for the main MGMT_GUID and the _WDG entry for the corresponding event GUID (not necessarily as part of the same underlying ACPI Device). In the case of both devices being present with duplicate MGMT, the first one that is parsed wins, and the other is ignored (see guid_already_parsed()). Sticking with the current approach, which is imperfect for devices that have both devices, adding a method to detect the _UID seems reasonable. Although actually I just realised you could probably also detect the difference by using wmi_has_guid() on the EVENT UUID without having to add a new function. I'll keep thinking about how to improve the situation around two devices present (but don't want to needlessly conflate this patch series with that). Daniel
Re: [PATCH v4 03/13] platform/x86: asus-wmi: Increase input buffer size of WMI methods
On Tue, May 14, 2019 at 12:54 PM Yurii Pavlovskyi wrote: > > The asus-nb-wmi driver is matched by WMI alias but fails to load on TUF > Gaming series laptops producing multiple ACPI errors in the kernel log. > > The input buffer for WMI method invocation size is 2 dwords, whereas > 3 are expected by this model. > > FX505GM: > .. > Method (WMNB, 3, Serialized) > { > P8XH (Zero, 0x11) > CreateDWordField (Arg2, Zero, IIA0) > CreateDWordField (Arg2, 0x04, IIA1) > CreateDWordField (Arg2, 0x08, IIA2) > Local0 = (Arg1 & 0x) > ... > > Compare with older K54C: > ... > Method (WMNB, 3, NotSerialized) > { > CreateDWordField (Arg2, 0x00, IIA0) > CreateDWordField (Arg2, 0x04, IIA1) > Local0 = (Arg1 & 0x) > ... > > Increase buffer size to 3 dwords. No negative consequences of this change > are expected, as the input buffer size is not verified. The original > function is replaced by a wrapper for a new method passing value 0 for the > last parameter. The new function will be used to control RGB keyboard > backlight. > > Signed-off-by: Yurii Pavlovskyi Reviewed-by: Daniel Drake
Re: [PATCH v4 02/13] platform/x86: asus-wmi: Fix preserving keyboard backlight intensity on load
On Tue, May 14, 2019 at 12:51 PM Yurii Pavlovskyi wrote: > > The error code and return value are mixed up. The intensity is always set > to 0 on load as kbd_led_read returns either 0 or negative value. To > reproduce set backlight to maximum, reload driver and try to increase it > using keyboard hotkey, the intensity will drop as a result. Correct the > implementation. > > Signed-off-by: Yurii Pavlovskyi Reviewed-by: Daniel Drake
Re: [PATCH v4 01/13] platform/x86: asus-wmi: Fix hwmon device cleanup
On Tue, May 14, 2019 at 12:50 PM Yurii Pavlovskyi wrote: > > The driver does not clean up the hwmon device on exit or error. To > reproduce the bug, repeat rmmod, insmod to verify that device number > /sys/devices/platform/asus-nb-wmi/hwmon/hwmon?? grows every time. Replace > call for registering device with devm_* version that unregisters it > automatically. > > Signed-off-by: Yurii Pavlovskyi Reviewed-by: Daniel Drake
Re: [RFC PATCH 2/2] rtl8xxxu: Add watchdog to update rate mask by signal strength
On Fri, May 10, 2019 at 2:37 AM Chris Chiu wrote: > I've verified that multiple virtual interface can not work simultaneously in > STA mode. I assigned different mac address for different vifs, I can only > bring only one interface up. If I want to bring the second vif up, it always > complains "SIOCSIFFLAGS: Device or resource busy". Interesting. Can you go deeper into that so that we can be more confident of this limitation? ieee80211_open() is the starting point. ieee80211_check_concurrent_iface() is one candidate to generate -EBUSY but from inspection, I don't think that's happening in this case, perhaps you can keep following through in order to figure out which part of the code is not allowing the 2nd STA interface to come up. Daniel
Re: [RFC PATCH 2/2] rtl8xxxu: Add watchdog to update rate mask by signal strength
On Thu, May 9, 2019 at 5:17 PM Chris Chiu wrote: > I need the vif because there's seems no easy way to get RSSI. Please > suggest if there's any better idea for this. I believe multiple vifs is for AP > mode (with more than 1 virtual AP/SSIDs) and the Station+AP coexist > mode. But the rtl8xxxu driver basically supports only Station mode. Yes, the driver only lets you create station interfaces, but it lets you create several of them. I'm not sure if that is intentional (and meaningful), or if its a bug. Maybe you can experiment with multiple station interfaces and see if it works in a meaningful way? Daniel
Re: [PATCH] i915: disable framebuffer compression on GeminiLake
Hi, On Thu, Apr 25, 2019 at 4:27 AM Paulo Zanoni wrote: > > Em qua, 2019-04-24 às 20:58 +0100, Chris Wilson escreveu: > > Quoting Jian-Hong Pan (2019-04-23 10:28:10) > > > From: Daniel Drake > > > > > > On many (all?) the Gemini Lake systems we work with, there is frequent > > > momentary graphical corruption at the top of the screen, and it seems > > > that disabling framebuffer compression can avoid this. > > > > > > The ticket was reported 6 months ago and has already affected a > > > multitude of users, without any real progress being made. So, lets > > > disable framebuffer compression on GeminiLake until a solution is found. > > > > > > Buglink: https://bugs.freedesktop.org/show_bug.cgi?id=108085 > > > Signed-off-by: Daniel Drake > > > Signed-off-by: Jian-Hong Pan > > > > Fixes: fd7d6c5c8f3e ("drm/i915: enable FBC on gen9+ too") ? > > Cc: Paulo Zanoni > > Cc: Daniel Vetter > > Cc: Jani Nikula > > Cc: # v4.11+ > > > > glk landed 1 month before, so that seems the earliest broken point. > > > > The bug is well reported, the bug author is helpful and it even has a > description of "steps to reproduce" that looks very easy (although I > didn't try it). Everything suggests this is a bug the display team > could actually solve with not-so-many hours of debugging. > > In the meantime, unbreak the systems: > Reviewed-by: Paulo Zanoni Quick ping here. Any further comments on this patch? Can it be applied? Thanks Daniel
[tip:x86/apic] x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
Commit-ID: 604dc9170f2435d27da5039a3efd757dceadc684 Gitweb: https://git.kernel.org/tip/604dc9170f2435d27da5039a3efd757dceadc684 Author: Daniel Drake AuthorDate: Thu, 9 May 2019 13:54:15 +0800 Committer: Ingo Molnar CommitDate: Thu, 9 May 2019 11:06:48 +0200 x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency native_calibrate_tsc() had a data mapping Intel CPU families and crystal clock speed, but hardcoded tables are not ideal, and this approach was already problematic at least in the Skylake X case, as seen in commit: b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon") By examining CPUID data from http://instlatx64.atw.hu/ and units in the lab, we have found that 3 different scenarios need to be dealt with, and we can eliminate most of the hardcoded data using an approach a little more advanced than before: 1. ApolloLake, GeminiLake, CannonLake (and presumably all new chipsets from this point) report the crystal frequency directly via CPUID.0x15. That's definitive data that we can rely upon. 2. Skylake, Kabylake and all variants of those two chipsets report a crystal frequency of zero, however we can calculate the crystal clock speed by condidering data from CPUID.0x16. This method correctly distinguishes between the two crystal clock frequencies present on different Skylake X variants that caused headaches before. As the calculations do not quite match the previously-hardcoded values in some cases (e.g. 23913043Hz instead of 24MHz), TSC refinement is enabled on all platforms where we had to calculate the crystal frequency in this way. 3. Denverton (GOLDMONT_X) reports a crystal frequency of zero and does not support CPUID.0x16, so we leave this entry hardcoded. Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: len.br...@intel.com Cc: li...@endlessm.com Cc: rafael.j.wyso...@intel.com Link: http://lkml.kernel.org/r/20190509055417.13152-1-dr...@endlessm.com Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/tsc.c | 47 +++ 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 15b5e98a86f9..6e6d933fb99c 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -631,31 +631,38 @@ unsigned long native_calibrate_tsc(void) crystal_khz = ecx_hz / 1000; - if (crystal_khz == 0) { - switch (boot_cpu_data.x86_model) { - case INTEL_FAM6_SKYLAKE_MOBILE: - case INTEL_FAM6_SKYLAKE_DESKTOP: - case INTEL_FAM6_KABYLAKE_MOBILE: - case INTEL_FAM6_KABYLAKE_DESKTOP: - crystal_khz = 24000;/* 24.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT_X: - crystal_khz = 25000;/* 25.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT: - crystal_khz = 19200;/* 19.2 MHz */ - break; - } - } + /* +* Denverton SoCs don't report crystal clock, and also don't support +* CPUID.0x16 for the calculation below, so hardcode the 25MHz crystal +* clock. +*/ + if (crystal_khz == 0 && + boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT_X) + crystal_khz = 25000; - if (crystal_khz == 0) - return 0; /* -* TSC frequency determined by CPUID is a "hardware reported" +* TSC frequency reported directly by CPUID is a "hardware reported" * frequency and is the most accurate one so far we have. This * is considered a known frequency. */ - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); + if (crystal_khz != 0) + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); + + /* +* Some Intel SoCs like Skylake and Kabylake don't report the crystal +* clock, but we can easily calculate it to a high degree of accuracy +* by considering the crystal ratio and the CPU speed. +*/ + if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= 0x16) { + unsigned int eax_base_mhz, ebx, ecx, edx; + + cpuid(0x16, _base_mhz, , , ); + crystal_khz = eax_base_mhz * 1000 * + eax_denominator / ebx_numerator; + } + + if (crystal_khz == 0) + return 0; /* * For Atom SoCs TSC is the only reliable clocksource.
[tip:x86/apic] x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
Commit-ID: 52ae346bd26c7a8b17ea82e9a09671e98c5402b7 Gitweb: https://git.kernel.org/tip/52ae346bd26c7a8b17ea82e9a09671e98c5402b7 Author: Daniel Drake AuthorDate: Thu, 9 May 2019 13:54:16 +0800 Committer: Ingo Molnar CommitDate: Thu, 9 May 2019 11:06:49 +0200 x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period' This variable is a period unit (number of clock cycles per jiffy), not a frequency (which is number of cycles per second). Give it a more appropriate name. Suggested-by: Ingo Molnar Signed-off-by: Daniel Drake Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: len.br...@intel.com Cc: li...@endlessm.com Cc: rafael.j.wyso...@intel.com Link: http://lkml.kernel.org/r/20190509055417.13152-2-dr...@endlessm.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/apic.h| 2 +- arch/x86/kernel/apic/apic.c| 20 ++-- arch/x86/kernel/cpu/mshyperv.c | 4 ++-- arch/x86/kernel/cpu/vmware.c | 2 +- arch/x86/kernel/jailhouse.c| 2 +- arch/x86/kernel/tsc_msr.c | 4 ++-- 6 files changed, 17 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 130e81e10fc7..fc505a84aa93 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -52,7 +52,7 @@ extern unsigned int apic_verbosity; extern int local_apic_timer_c2_ok; extern int disable_apic; -extern unsigned int lapic_timer_frequency; +extern unsigned int lapic_timer_period; extern enum apic_intr_mode_id apic_intr_mode; enum apic_intr_mode_id { diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ab6af775f06c..93de7862eef8 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -194,7 +194,7 @@ static struct resource lapic_resource = { .flags = IORESOURCE_MEM | IORESOURCE_BUSY, }; -unsigned int lapic_timer_frequency = 0; +unsigned int lapic_timer_period = 0; static void apic_pm_activate(void); @@ -500,7 +500,7 @@ lapic_timer_set_periodic_oneshot(struct clock_event_device *evt, bool oneshot) if (evt->features & CLOCK_EVT_FEAT_DUMMY) return 0; - __setup_APIC_LVTT(lapic_timer_frequency, oneshot, 1); + __setup_APIC_LVTT(lapic_timer_period, oneshot, 1); return 0; } @@ -804,11 +804,11 @@ calibrate_by_pmtimer(long deltapm, long *delta, long *deltatsc) static int __init lapic_init_clockevent(void) { - if (!lapic_timer_frequency) + if (!lapic_timer_period) return -1; /* Calculate the scaled math multiplication factor */ - lapic_clockevent.mult = div_sc(lapic_timer_frequency/APIC_DIVISOR, + lapic_clockevent.mult = div_sc(lapic_timer_period/APIC_DIVISOR, TICK_NSEC, lapic_clockevent.shift); lapic_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF, _clockevent); @@ -838,7 +838,7 @@ static int __init calibrate_APIC_clock(void) */ if (!lapic_init_clockevent()) { apic_printk(APIC_VERBOSE, "lapic timer already calibrated %d\n", - lapic_timer_frequency); + lapic_timer_period); /* * Direct calibration methods must have an always running * local APIC timer, no need for broadcast timer. @@ -883,13 +883,13 @@ static int __init calibrate_APIC_clock(void) pm_referenced = !calibrate_by_pmtimer(lapic_cal_pm2 - lapic_cal_pm1, , ); - lapic_timer_frequency = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; + lapic_timer_period = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; lapic_init_clockevent(); apic_printk(APIC_VERBOSE, ". delta %ld\n", delta); apic_printk(APIC_VERBOSE, ". mult: %u\n", lapic_clockevent.mult); apic_printk(APIC_VERBOSE, ". calibration result: %u\n", - lapic_timer_frequency); + lapic_timer_period); if (boot_cpu_has(X86_FEATURE_TSC)) { apic_printk(APIC_VERBOSE, ". CPU clock speed is " @@ -900,13 +900,13 @@ static int __init calibrate_APIC_clock(void) apic_printk(APIC_VERBOSE, ". host bus clock speed is " "%u.%04u MHz.\n", - lapic_timer_frequency / (100 / HZ), - lapic_timer_frequency % (100 / HZ)); + lapic_timer_period / (100 / HZ), + lapic_timer_period % (100 / HZ)); /* * Do a sanity check on the APIC calibration result */ - if (lapic_timer_frequency < (100 / HZ)) { + if (lapic_timer_period < (100 / HZ)) { local_irq_enable(); pr_warning("
[tip:x86/apic] x86/tsc: Set LAPIC timer period to crystal clock frequency
Commit-ID: 2420a0b1798d7a78d1f9b395f09f3c80d92cc588 Gitweb: https://git.kernel.org/tip/2420a0b1798d7a78d1f9b395f09f3c80d92cc588 Author: Daniel Drake AuthorDate: Thu, 9 May 2019 13:54:17 +0800 Committer: Ingo Molnar CommitDate: Thu, 9 May 2019 11:06:49 +0200 x86/tsc: Set LAPIC timer period to crystal clock frequency The APIC timer calibration (calibrate_APIC_timer()) can be skipped in cases where we know the APIC timer frequency. On Intel SoCs, we believe that the APIC is fed by the crystal clock; this would make sense, and the crystal clock frequency has been verified against the APIC timer calibration result on ApolloLake, GeminiLake, Kabylake, CoffeeLake, WhiskeyLake and AmberLake. Set lapic_timer_period based on the crystal clock frequency accordingly. APIC timer calibration would normally be skipped on modern CPUs by nature of the TSC deadline timer being used instead, however this change is still potentially useful, e.g. if the TSC deadline timer has been disabled with a kernel parameter. calibrate_APIC_timer() uses the legacy timer, but we are seeing new platforms that omit such legacy functionality, so avoiding such codepaths is becoming more important. Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake Reviewed-by: Thomas Gleixner Cc: Andy Lutomirski Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: len.br...@intel.com Cc: li...@endlessm.com Cc: rafael.j.wyso...@intel.com Link: http://lkml.kernel.org/r/20190509055417.13152-3-dr...@endlessm.com Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Link: https://lkml.kernel.org/r/alpine.deb.2.21.1904031206440.1...@nanos.tec.linutronix.de Signed-off-by: Ingo Molnar --- arch/x86/kernel/tsc.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 6e6d933fb99c..8f47c4862c56 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -671,6 +671,16 @@ unsigned long native_calibrate_tsc(void) if (boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT) setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); +#ifdef CONFIG_X86_LOCAL_APIC + /* +* The local APIC appears to be fed by the core crystal clock +* (which sounds entirely sensible). We can set the global +* lapic_timer_period here to avoid having to calibrate the APIC +* timer later. +*/ + lapic_timer_period = crystal_khz * 1000 / HZ; +#endif + return crystal_khz * ebx_numerator / eax_denominator; }
Re: [RFC PATCH 1/2] rtl8xxxu: Add rate adaptive related data
Hi Chris, Thanks for this! Some suggestions below, although let me know if any don't make sense. On Fri, May 3, 2019 at 3:22 PM Chris Chiu wrote: > > Add wireless mode, signal strength level, and rate table index > to tell the firmware that we need to adjust the tx rate bitmap > accordingly. > --- > .../net/wireless/realtek/rtl8xxxu/rtl8xxxu.h | 45 +++ > .../wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 45 ++- > 2 files changed, 89 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h > b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h > index 8828baf26e7b..771f58aa7cae 100644 > --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h > +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h > @@ -1195,6 +1195,50 @@ struct rtl8723bu_c2h { > > struct rtl8xxxu_fileops; > > +/*mlme related.*/ > +enum wireless_mode { > + WIRELESS_MODE_UNKNOWN = 0, > + //Sub-Element Run these patches through checkpatch.pl, it'll have some suggestions to bring the coding style in line, for example not using // style comments. > + WIRELESS_MODE_B = BIT(0), // tx: cck only , rx: cck only, hw: cck > + WIRELESS_MODE_G = BIT(1), // tx: ofdm only, rx: ofdm & cck, hw: cck & > ofdm > + WIRELESS_MODE_A = BIT(2), // tx: ofdm only, rx: ofdm only, hw: ofdm > only > + WIRELESS_MODE_N_24G = BIT(3), // tx: MCS only, rx: MCS & cck, hw: MCS > & cck > + WIRELESS_MODE_N_5G = BIT(4), // tx: MCS only, rx: MCS & ofdm, hw: > ofdm only > + WIRELESS_AUTO = BIT(5), > + WIRELESS_MODE_AC = BIT(6), > + WIRELESS_MODE_MAX = > (WIRELESS_MODE_B|WIRELESS_MODE_G|WIRELESS_MODE_A|WIRELESS_MODE_N_24G|WIRELESS_MODE_N_5G|WIRELESS_MODE_AC), > +}; > + > +/* from rtlwifi/wifi.h */ > +enum ratr_table_mode_new { > +RATEID_IDX_BGN_40M_2SS = 0, > +RATEID_IDX_BGN_40M_1SS = 1, > +RATEID_IDX_BGN_20M_2SS_BN = 2, > +RATEID_IDX_BGN_20M_1SS_BN = 3, > +RATEID_IDX_GN_N2SS = 4, > +RATEID_IDX_GN_N1SS = 5, > +RATEID_IDX_BG = 6, > +RATEID_IDX_G = 7, > +RATEID_IDX_B = 8, > +RATEID_IDX_VHT_2SS = 9, > +RATEID_IDX_VHT_1SS = 10, > +RATEID_IDX_MIX1 = 11, > +RATEID_IDX_MIX2 = 12, > +RATEID_IDX_VHT_3SS = 13, > +RATEID_IDX_BGN_3SS = 14, > +}; > + > +#define RTL8XXXU_RATR_STA_INIT 0 > +#define RTL8XXXU_RATR_STA_HIGH 1 > +#define RTL8XXXU_RATR_STA_MID 2 > +#define RTL8XXXU_RATR_STA_LOW 3 > + > +struct rtl8xxxu_rate_adaptive { > + u16 wireless_mode; > + u8 ratr_index; > + u8 rssi_level; /* INIT, HIGH, MIDDLE, LOW */ > +} __packed; It would be better/cleaner to avoid storing data in per-device structures if at all possible. For wireless_mode, I think you should just call rtl8xxxu_wireless_mode() every time from rtl8723b_refresh_rate_mask(). The work done there is simple (i.e. it's not expensive to call) and then we avoid having to store data (which might become stale etc). For ratr_index, I believe you can just make it a parameter passed to rtl8xxxu_gen2_update_rate_mask which is the only consumer of this variable. The two callsites (rtl8xxxu_bss_info_changed and rtl8723b_refresh_rate_mask) already know which value they want to be used. rssi_level is the one that you probably do want to store, since I see the logic in patch 2 - if the rssi level hasn't changed then you don't need to set the rate mask again, and that's a good idea since it reduces bus traffic. You could move this into rtl8xxxu_priv rather than having its own separate structure. After making those changes it might even make sense to collapse this all into a single patch; you can judge!
Re: [PATCH v3 04/11] platform/x86: asus-wmi: Improve DSTS WMI method ID detection
> > -#define ASUS_WMI_METHODID_DSTS 0x53544344 /* Device STatuS */ > > -#define ASUS_WMI_METHODID_DSTS20x53545344 /* Device STatuS > > #2*/ > > > +#define ASUS_WMI_METHODID_DSTS 0x53544344 /* Device STatuS (DCTS) > > */ > > It's not clear from the description what 'C' stands for. > Is there any specification which describes the difference and actual > abbreviations? The (recent) spec I have here doesn't mention 0x53544344 (DCTS). However the spec changelog does mention that EEEPC stuff was removed from the spec a while ago. The spec does mention 0x53545344 (DSTS), labelled as "Get device status". For clarity I think the constants could be renamed as ASUS_WMI_METHODID_DCTS and ASUS_WMI_METHODID_DSTS. Daniel
[PATCH v2 1/3] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency
native_calibrate_tsc() had a data mapping Intel CPU families and crystal clock speed, but hardcoded tables are not ideal, and this approach was already problematic at least in the Skylake X case, as seen in commit b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon"). By examining CPUID data from http://instlatx64.atw.hu/ and units in the lab, we have found that 3 different scenarios need to be dealt with, and we can eliminate most of the hardcoded data using an approach a little more advanced than before: 1. ApolloLake, GeminiLake, CannonLake (and presumably all new chipsets from this point) report the crystal frequency directly via CPUID.0x15. That's definitive data that we can rely upon. 2. Skylake, Kabylake and all variants of those two chipsets report a crystal frequency of zero, however we can calculate the crystal clock speed by condidering data from CPUID.0x16. This method correctly distinguishes between the two crystal clock frequencies present on different Skylake X variants that caused headaches before. As the calculations do not quite match the previously-hardcoded values in some cases (e.g. 23913043Hz instead of 24MHz), TSC refinement is enabled on all platforms where we had to calculate the crystal frequency in this way. 3. Denverton (GOLDMONT_X) reports a crystal frequency of zero and does not support CPUID.0x16, so we leave this entry hardcoded. Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake --- Notes: v2: - Clarify the situation around Skylake X better. - Enable TSC refinement when we had to calculate the crystal clock, in case slight differences in the calculation result cause problems similar to those reported earlier on Skylake X. arch/x86/kernel/tsc.c | 47 +-- 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 15b5e98a86f9..6e6d933fb99c 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -631,31 +631,38 @@ unsigned long native_calibrate_tsc(void) crystal_khz = ecx_hz / 1000; - if (crystal_khz == 0) { - switch (boot_cpu_data.x86_model) { - case INTEL_FAM6_SKYLAKE_MOBILE: - case INTEL_FAM6_SKYLAKE_DESKTOP: - case INTEL_FAM6_KABYLAKE_MOBILE: - case INTEL_FAM6_KABYLAKE_DESKTOP: - crystal_khz = 24000;/* 24.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT_X: - crystal_khz = 25000;/* 25.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT: - crystal_khz = 19200;/* 19.2 MHz */ - break; - } - } + /* +* Denverton SoCs don't report crystal clock, and also don't support +* CPUID.0x16 for the calculation below, so hardcode the 25MHz crystal +* clock. +*/ + if (crystal_khz == 0 && + boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT_X) + crystal_khz = 25000; - if (crystal_khz == 0) - return 0; /* -* TSC frequency determined by CPUID is a "hardware reported" +* TSC frequency reported directly by CPUID is a "hardware reported" * frequency and is the most accurate one so far we have. This * is considered a known frequency. */ - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); + if (crystal_khz != 0) + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); + + /* +* Some Intel SoCs like Skylake and Kabylake don't report the crystal +* clock, but we can easily calculate it to a high degree of accuracy +* by considering the crystal ratio and the CPU speed. +*/ + if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= 0x16) { + unsigned int eax_base_mhz, ebx, ecx, edx; + + cpuid(0x16, _base_mhz, , , ); + crystal_khz = eax_base_mhz * 1000 * + eax_denominator / ebx_numerator; + } + + if (crystal_khz == 0) + return 0; /* * For Atom SoCs TSC is the only reliable clocksource. -- 2.20.1
[PATCH v2 2/3] x86/apic: rename lapic_timer_frequency to lapic_timer_period
This variable is a period unit (number of clock cycles per jiffy), not a frequency (which is number of cycles per second). Give it a more appropriate name. Suggested-by: Ingo Molnar Signed-off-by: Daniel Drake --- arch/x86/include/asm/apic.h| 2 +- arch/x86/kernel/apic/apic.c| 20 ++-- arch/x86/kernel/cpu/mshyperv.c | 4 ++-- arch/x86/kernel/cpu/vmware.c | 2 +- arch/x86/kernel/jailhouse.c| 2 +- arch/x86/kernel/tsc_msr.c | 4 ++-- 6 files changed, 17 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 130e81e10fc7..fc505a84aa93 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -52,7 +52,7 @@ extern unsigned int apic_verbosity; extern int local_apic_timer_c2_ok; extern int disable_apic; -extern unsigned int lapic_timer_frequency; +extern unsigned int lapic_timer_period; extern enum apic_intr_mode_id apic_intr_mode; enum apic_intr_mode_id { diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ab6af775f06c..93de7862eef8 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -194,7 +194,7 @@ static struct resource lapic_resource = { .flags = IORESOURCE_MEM | IORESOURCE_BUSY, }; -unsigned int lapic_timer_frequency = 0; +unsigned int lapic_timer_period = 0; static void apic_pm_activate(void); @@ -500,7 +500,7 @@ lapic_timer_set_periodic_oneshot(struct clock_event_device *evt, bool oneshot) if (evt->features & CLOCK_EVT_FEAT_DUMMY) return 0; - __setup_APIC_LVTT(lapic_timer_frequency, oneshot, 1); + __setup_APIC_LVTT(lapic_timer_period, oneshot, 1); return 0; } @@ -804,11 +804,11 @@ calibrate_by_pmtimer(long deltapm, long *delta, long *deltatsc) static int __init lapic_init_clockevent(void) { - if (!lapic_timer_frequency) + if (!lapic_timer_period) return -1; /* Calculate the scaled math multiplication factor */ - lapic_clockevent.mult = div_sc(lapic_timer_frequency/APIC_DIVISOR, + lapic_clockevent.mult = div_sc(lapic_timer_period/APIC_DIVISOR, TICK_NSEC, lapic_clockevent.shift); lapic_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF, _clockevent); @@ -838,7 +838,7 @@ static int __init calibrate_APIC_clock(void) */ if (!lapic_init_clockevent()) { apic_printk(APIC_VERBOSE, "lapic timer already calibrated %d\n", - lapic_timer_frequency); + lapic_timer_period); /* * Direct calibration methods must have an always running * local APIC timer, no need for broadcast timer. @@ -883,13 +883,13 @@ static int __init calibrate_APIC_clock(void) pm_referenced = !calibrate_by_pmtimer(lapic_cal_pm2 - lapic_cal_pm1, , ); - lapic_timer_frequency = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; + lapic_timer_period = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; lapic_init_clockevent(); apic_printk(APIC_VERBOSE, ". delta %ld\n", delta); apic_printk(APIC_VERBOSE, ". mult: %u\n", lapic_clockevent.mult); apic_printk(APIC_VERBOSE, ". calibration result: %u\n", - lapic_timer_frequency); + lapic_timer_period); if (boot_cpu_has(X86_FEATURE_TSC)) { apic_printk(APIC_VERBOSE, ". CPU clock speed is " @@ -900,13 +900,13 @@ static int __init calibrate_APIC_clock(void) apic_printk(APIC_VERBOSE, ". host bus clock speed is " "%u.%04u MHz.\n", - lapic_timer_frequency / (100 / HZ), - lapic_timer_frequency % (100 / HZ)); + lapic_timer_period / (100 / HZ), + lapic_timer_period % (100 / HZ)); /* * Do a sanity check on the APIC calibration result */ - if (lapic_timer_frequency < (100 / HZ)) { + if (lapic_timer_period < (100 / HZ)) { local_irq_enable(); pr_warning("APIC frequency too slow, disabling apic timer\n"); return -1; diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 3fa238a137d2..faae6115ddef 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -270,9 +270,9 @@ static void __init ms_hyperv_init_platform(void) rdmsrl(HV_X64_MSR_APIC_FREQUENCY, hv_lapic_frequency); hv_lapic_frequency = div_u64(hv_lapic_frequency, HZ); - lapic_timer_frequency = hv_lapic_frequency; + lapic_timer_period = hv_lapic_frequency; p
[PATCH v2 3/3] x86/tsc: set LAPIC timer period to crystal clock frequency
The APIC timer calibration (calibrate_APIC_timer()) can be skipped in cases where we know the APIC timer frequency. On Intel SoCs, we believe that the APIC is fed by the crystal clock; this would make sense, and the crystal clock frequency has been verified against the APIC timer calibration result on ApolloLake, GeminiLake, Kabylake, CoffeeLake, WhiskeyLake and AmberLake. Set lapic_timer_period based on the crystal clock frequency accordingly. APIC timer calibration would normally be skipped on modern CPUs by nature of the TSC deadline timer being used instead, however this change is still potentially useful, e.g. if the TSC deadline timer has been disabled with a kernel parameter. calibrate_APIC_timer() uses the legacy timer, but we are seeing new platforms that omit such legacy functionality, so avoiding such codepaths is becoming more important. Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Link: https://lkml.kernel.org/r/alpine.deb.2.21.1904031206440.1...@nanos.tec.linutronix.de Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake --- Notes: v2: - remove unnecessary braces - use newly renamed variable lapic_timer_period arch/x86/kernel/tsc.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 6e6d933fb99c..8f47c4862c56 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -671,6 +671,16 @@ unsigned long native_calibrate_tsc(void) if (boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT) setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); +#ifdef CONFIG_X86_LOCAL_APIC + /* +* The local APIC appears to be fed by the core crystal clock +* (which sounds entirely sensible). We can set the global +* lapic_timer_period here to avoid having to calibrate the APIC +* timer later. +*/ + lapic_timer_period = crystal_khz * 1000 / HZ; +#endif + return crystal_khz * ebx_numerator / eax_denominator; } -- 2.20.1
Re: sg_dma_page_iter offset & length considerations
On Wed, Apr 24, 2019 at 7:22 PM Jason Gunthorpe wrote: > A driver that simply wants a SGL with a capped max size (ie 4k?) > should use the dma_set_max_seg_size() API and just never get a SGE > with a larger length. That sounds like exactly what we want here. However I tried that function and it seems to have no effect. Larger SGEs still appear just as readily as before. I tried to find the code that honours the value set here, but didn't find anything. Sure there are some non-x86 archs and some iommu drivers that appear to do something with it, but in this case (standard Intel x86_64) it seems like nothing observes/honours this value. That did lead me back to the similarly-named mmc_host.max_seg_size though, which is passed through to blk_queue_max_segment_size(). That is documented as "Enables a low level driver to set an upper limit on the size of a coalesced segment", but perhaps I don't fully grasp what it means by "coalesced segment". The alcor driver currently has: mmc->max_segs = 64; mmc->max_seg_size = 240 * 512; which I had interpreted to mean "we can accept a sglist with maximum 64 entries, and if you sum the size of all entries in that sglist, the total must not exceed 240 sectors". i.e. with "coalesced" referring to the overall sum; the description does not suggest to me that we can influence the size of an individual segment here. Other mmc drivers may also share a similar interpretation, e.g. see atmel-mci.c: mmc->max_seg_size = mmc->max_blk_size * mmc->max_segs; but I just experimented with the driver again, and setting mmc_host.max_seg_size to 4096 does seem to do what we want here. Now we get multi-entry sg lists but each sge is max 4096 bytes in size. I'll do some more verification and then send a patch along these lines. Thanks for your comments! Daniel
sg_dma_page_iter offset & length considerations
Hi, In drivers/mmc/alcor.c we're working with a MMC controller which supports DMA transfers split up into page-sized chunks. A DMA transfer consists of multiple pages, after each page is transferred there is an interrupt so that the driver can program the DMA address of the next page in the sequence. All pages must be complete in length, only the last one can be a partial transfer. I thought that the sg_dma_page_iter API looked like a great fit here: the driver can accept any old sglist, and then use this new API to collapse it into a list of pages that can be easily iterated over, and fed to the hardware one at a time. But looking closer I think I may have made some bad assumptions, and I'm left with some fundamental questions about this API. Specifically I can see userspace generates requests which present a sglist such as: - first entry with offset=1536 length=2560 - 7 entries with offset=0 length=4096 - last entry with offset=0 length=1536 I gather that dma_map_sg() will take care off the offsets, i.e. any physical address I get with sg_page_iter_dma_address() will already have the offset applied, so I don't have to worry about tracking that. But what about the length? For every page returned by the iterator, I can't assume that I am being asked to work with the full page, right? Such as the first and last page in the above example. I need to go back to the sglist to check the corresponding length variable, and having to go back to check the sglist seems to defeat the convenience of having the sglist collapsed into a list of pages by the iterator. Any comments? Thanks Daniel
Re: [PATCH 1/2] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency
On Tue, Apr 23, 2019 at 5:08 PM Peter Zijlstra wrote: > Well, SKX isn't exactly 'missing'; it would be very good if we can > confirm the new code is still correct under the below mentioned > conditions. > > commit b511203093489eb1829cb4de86e8214752205ac6 > Author: Len Brown > Date: Fri Dec 22 00:27:55 2017 -0500 > > x86/tsc: Fix erroneous TSC rate on Skylake Xeon > > The INTEL_FAM6_SKYLAKE_X hardcoded crystal_khz value of 25MHZ is > problematic: > > - SKX workstations (with same model # as server variants) use a 24 MHz >crystal. This results in a -4.0% time drift rate on SKX workstations. Checking http://instlatx64.atw.hu/ there are 11 platforms listed as 00050654 (skylake X), doing the calculations manually: 18-Core Intel Xeon Gold 6154, 3000 MHz: 2500 2x 16-Core Intel Xeon Gold 6130, 2100 MHz: 2500 2x 18-Core Intel Xeon Gold 6154, 3000 MHz: 2500 2x 28-Core Intel Xeon Platinum 8180, 2500 MHz: 2500 2x HexaCore Intel Xeon Bronze 3106: 2500 2x OctalCore Intel Xeon Silver 4108, 3000 MHz: 2500 10-Core Xeon W-2155: 23913043 HexaCore Intel Core i7-7800X: 23972602 10-Core Intel Core i9-7900X, 4000 MHz: 23913043 18-Core Intel Core i9-7980XE: 24074074 28-Core Intel Xeon W-3175X: 2500 So given that the results include 24MHz and 25MHz crystal clocks calculated on different products then it looks like this variance is correctly accounted for. > - SKX servers subject the crystal to an EMI reduction circuit that > reduces its >actual frequency by (approximately) -0.25%. This results in -1 second > per >10 minute time drift as compared to network time. > > This issue can also trigger a timer and power problem, on configurations > that use the LAPIC timer (versus the TSC deadline timer). Clock ticks > scheduled with the LAPIC timer arrive a few usec before the time they are > expected (according to the slow TSC). This causes Linux to poll-idle, > when > it should be in an idle power saving state. The idle and clock code do > not > graciously recover from this error, sometimes resulting in significant > polling and measurable power impact. > > Stop using native_calibrate_tsc() for INTEL_FAM6_SKYLAKE_X. > native_calibrate_tsc() will return 0, boot will run with tsc_khz = > cpu_khz, > and the TSC refined calibration will update tsc_khz to correct for the > difference. It sounds like I should add a condition: if (boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X) setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); so that refined TSC calibration will run on SKYLAKE_X. That does raise another question though. Taking other platforms such as KABYLAKE_MOBILE, the previous code hardcoded a precise 24MHz value, but my new code calculates 23893805Hz. Is that small discrepancy small enough to trigger the timer and power problem as described for SKYLAKE_X above? If so, maybe the logic should instead be: if (CPUID.0x15 provided Crystal Hz, i.e. we didn't have to calculate it) setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ); This would enable TSC refinement on all variants of skylake and kabylake. Daniel
[PATCH 1/2] x86/time: check usability of IRQ0 PIT timer
Modern Intel SoCs now include a special ITSSPRC register that can be used to "gate" the PIT such that IRQ0 interrupts do not fire. With Intel Apollo Lake we are starting to see consumer products that have a BIOS option to apply this (defaulting to gated). Some such products also lack the HPET ACPI table, so there is no HPET either. At this point, Linux needs to stop assuming that the IRQ0 timer is available. Move APIC code to check IRQ0 to time.c, then check and record the IRQ0 PIT timer usability after it is set up. If it does not produce any interrupts, unregister the clock event source. Signed-off-by: Daniel Drake Link: https://lkml.kernel.org/r/CAD8Lp45fedoPLnK=umuhhtkjy5u2h04sykrx3u+m04u6fpv...@mail.gmail.com --- arch/x86/include/asm/time.h| 2 + arch/x86/kernel/apic/io_apic.c | 101 --- arch/x86/kernel/i8253.c| 6 ++ arch/x86/kernel/time.c | 106 - drivers/clocksource/i8253.c| 6 ++ include/linux/clockchips.h | 3 + include/linux/i8253.h | 2 + kernel/time/tick-internal.h| 2 - 8 files changed, 134 insertions(+), 94 deletions(-) diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h index cef818b16045..e6e00d18b39f 100644 --- a/arch/x86/include/asm/time.h +++ b/arch/x86/include/asm/time.h @@ -10,4 +10,6 @@ extern void time_init(void); extern struct clock_event_device *global_clock_event; +extern bool irq0_timer_works(void); + #endif /* _ASM_X86_TIME_H */ diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 53aa234a6803..ae46da48c07b 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include #include @@ -1573,92 +1574,6 @@ void __init setup_ioapic_ids_from_mpc(void) } #endif -int no_timer_check __initdata; - -static int __init notimercheck(char *s) -{ - no_timer_check = 1; - return 1; -} -__setup("no_timer_check", notimercheck); - -static void __init delay_with_tsc(void) -{ - unsigned long long start, now; - unsigned long end = jiffies + 4; - - start = rdtsc(); - - /* -* We don't know the TSC frequency yet, but waiting for -* 400/HZ TSC cycles is safe: -* 4 GHz == 10 jiffies -* 1 GHz == 40 jiffies -*/ - do { - rep_nop(); - now = rdtsc(); - } while ((now - start) < 400ULL / HZ && - time_before_eq(jiffies, end)); -} - -static void __init delay_without_tsc(void) -{ - unsigned long end = jiffies + 4; - int band = 1; - - /* -* We don't know any frequency yet, but waiting for -* 4094000/HZ cycles is safe: -* 4 GHz == 10 jiffies -* 1 GHz == 40 jiffies -* 1 << 1 + 1 << 2 +...+ 1 << 11 = 4094 -*/ - do { - __delay(((1U << band++) * 1000UL) / HZ); - } while (band < 12 && time_before_eq(jiffies, end)); -} - -/* - * There is a nasty bug in some older SMP boards, their mptable lies - * about the timer IRQ. We do the following to work around the situation: - * - * - timer IRQ defaults to IO-APIC IRQ - * - if this function detects that timer IRQs are defunct, then we fall - * back to ISA timer IRQs - */ -static int __init timer_irq_works(void) -{ - unsigned long t1 = jiffies; - unsigned long flags; - - if (no_timer_check) - return 1; - - local_save_flags(flags); - local_irq_enable(); - - if (boot_cpu_has(X86_FEATURE_TSC)) - delay_with_tsc(); - else - delay_without_tsc(); - - local_irq_restore(flags); - - /* -* Expect a few ticks at least, to be sure some possible -* glue logic does not lock up after one or two first -* ticks in a non-ExtINT mode. Also the local APIC -* might have cached one ExtINT interrupt. Finally, at -* least one tick may be lost due to delays. -*/ - - /* jiffies wrap? */ - if (time_after(jiffies, t1 + 4)) - return 1; - return 0; -} - /* * In the SMP+IOAPIC case it might happen that there are an unspecified * number of pending IRQ events unhandled. These cases are very rare, @@ -2066,6 +1981,12 @@ static int mp_alloc_timer_irq(int ioapic, int pin) } /* + * In the SMP+IOAPIC case it might happen that there are an unspecified + * number of pending IRQ events unhandled. These cases are very rare, + * so we 'resend' these IRQs via IPIs, to the same CPU. It's much + * better to do it this way as thus we do not have to be aware of + * 'pending' interrupts in the IRQ path, except at this point. + * * This code may look a bit paranoid, but it's supposed to cooperate with * a wide range of boards and BIOS bugs.
[PATCH 2/2] x86/ioapic: avoid timer manipulation when IRQ0 timer is unavailable
New products based on Intel Apollo Lake are appearing where the HPET is not present in ACPI, and the legacy 8254 PIT is "gated" by default in the BIOS setup menu. This leads an early boot "IO-APIC + timer doesn't work!" kernel panic on a black screen (before the framebuffer is initialized). Avoid the IO-APIC IRQ0 timer manipulation & verification on platforms where the legacy IRQ0 timer has been determined as unavailable. This fixes boot on Connex L1430 and Scope SN116PYA with default BIOS settings. Signed-off-by: Daniel Drake Link: https://lkml.kernel.org/r/CAD8Lp45fedoPLnK=umuhhtkjy5u2h04sykrx3u+m04u6fpv...@mail.gmail.com --- arch/x86/kernel/apic/io_apic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index ae46da48c07b..2d29c62abbcb 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2243,7 +2243,7 @@ void __init setup_IO_APIC(void) sync_Arb_IDs(); setup_IO_APIC_irqs(); init_IO_APIC_traps(); - if (nr_legacy_irqs()) + if (global_clock_event && nr_legacy_irqs()) check_timer(); ioapic_initialized = 1; -- 2.19.1
Re: [PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency
On Mon, Apr 22, 2019 at 8:04 PM Ingo Molnar wrote: > Minor style nit: the parentheses are unnecessary, integer expressions > like this are evaluated left to right and multiplication and division has > the same precedence. Fair point, although the same could be said for cpu_khz_from_msr(). > But it might also make sense to actually store crystal_mhz instead of > crystal_khz, because both CPUID 15H and 16H provides MHz values. > > That way the above expression would simplify to: > > lapic_timer_frequency = crystal_mhz / HZ; Wouldn't it be lapic_timer_frequency = crystal_mhz * 100 / HZ; ? Thanks Daniel
[PATCH 2/2] x86/tsc: set LAPIC timer frequency to crystal clock frequency
The APIC timer calibration (calibrate_APIC_timer()) can be skipped in cases where we know the APIC timer frequency. On Intel SoCs, we believe that the APIC is fed by the crystal clock; this would make sense, and the crystal clock frequency has been verified against the APIC timer calibration result on ApolloLake, GeminiLake, Kabylake, CoffeeLake, WhiskeyLake and AmberLake. Set lapic_timer_frequency based on the crystal clock frequency accordingly. APIC timer calibration would normally be skipped on modern CPUs by nature of the TSC deadline timer being used instead, however this change is still potentially useful, e.g. if the TSC deadline timer has been disabled with a kernel parameter. calibrate_APIC_timer() uses the legacy timer, but we are seeing new platforms that omit such legacy functionality, so avoiding such codepaths is becoming more important. Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Link: https://lkml.kernel.org/r/alpine.deb.2.21.1904031206440.1...@nanos.tec.linutronix.de Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake --- arch/x86/kernel/tsc.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 3971c837584a..8750543287fc 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -665,6 +665,16 @@ unsigned long native_calibrate_tsc(void) if (boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT) setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); +#ifdef CONFIG_X86_LOCAL_APIC + /* +* The local APIC appears to be fed by the core crystal clock +* (which sounds entirely sensible). We can set the global +* lapic_timer_frequency here to avoid having to calibrate the APIC +* timer later. +*/ + lapic_timer_frequency = (crystal_khz * 1000) / HZ; +#endif + return crystal_khz * ebx_numerator / eax_denominator; } -- 2.19.1
[PATCH 1/2] x86/tsc: use CPUID.0x16 to calculate missing crystal frequency
native_calibrate_tsc() had a hardcoded table of Intel CPU families and crystal clock, but we have found that it is possible to calculate the crystal clock speed, and this is preferred to a hardcoded table. Where crystal clock frequency was not reported by CPUID.0x15, use CPUID.0x16 data to calculate the crystal clock. Using CPUID dump data from http://instlatx64.atw.hu/, the calculation results can be seen to be sufficiently close to the previously hardcoded values: SKYLAKE_MOBILE: 24074074 Hz SKYLAKE_DESKTOP: 23913043 Hz KABYLAKE_MOBILE: 23893805 Hz KABYLAKE_DESKTOP: 24050632 Hz GOLDMONT: 19.2MHz crystal clock correctly reported by CPUID.0x15 Additionally, crystal clock frequency for platforms that were missing from the list (e.g. SKYLAKE_X) will now be provided. GOLDMONT_X was left as a hardcoded value, as the CPUID data on that site indicates that the hardware does not report crystal clock nor CPUID.0x16 data. Going forward, Intel have hopefully now started providing crystal frequency in CPUID.0x15. At least ApolloLake, GeminiLake and CannonLake CPUs all provide the relevant data directly. Link: https://lkml.kernel.org/r/20190419083533.32388-1-dr...@endlessm.com Suggested-by: Thomas Gleixner Signed-off-by: Daniel Drake --- arch/x86/kernel/tsc.c | 33 ++--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 3fae23834069..3971c837584a 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -629,23 +629,26 @@ unsigned long native_calibrate_tsc(void) crystal_khz = ecx_hz / 1000; - if (crystal_khz == 0) { - switch (boot_cpu_data.x86_model) { - case INTEL_FAM6_SKYLAKE_MOBILE: - case INTEL_FAM6_SKYLAKE_DESKTOP: - case INTEL_FAM6_KABYLAKE_MOBILE: - case INTEL_FAM6_KABYLAKE_DESKTOP: - crystal_khz = 24000;/* 24.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT_X: - crystal_khz = 25000;/* 25.0 MHz */ - break; - case INTEL_FAM6_ATOM_GOLDMONT: - crystal_khz = 19200;/* 19.2 MHz */ - break; - } + /* +* Some Intel SoCs like Skylake and Kabylake don't report the crystal +* clock, but we can easily calculate it by considering the crystal +* ratio and the CPU speed. +*/ + if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= 0x16) { + unsigned int eax_base_mhz, ebx, ecx, edx; + cpuid(0x16, _base_mhz, , , ); + crystal_khz = eax_base_mhz * 1000 * \ + eax_denominator / ebx_numerator; } + /* +* Denverton SoCs don't report crystal clock, and also don't support +* CPUID.0x16, so hardcode the 25MHz crystal clock. +*/ + if (crystal_khz == 0 && + boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT_X) + crystal_khz = 25000; + if (crystal_khz == 0) return 0; /* -- 2.19.1
Re: [PATCH] platform/x86: asus-wmi: Add fn-lock mode switch support
On Thu, Apr 18, 2019 at 2:47 PM Chris Chiu wrote: > > Some of latest ASUS laptops support new fn-lock mode switching. > This commit detect whether if the fn-lock option is enabled in > BIOS setting, and toggle the fn-lock mode via a new WMI DEVID > 0x00100023 when the corresponding notify code captured. > > The ASUS fn-lock mode switch is activated by pressing Fn+Esc. > When on, keys F1 to F12 behave as applicable, with meanings > defined by the application being used at the time. When off, > F1 to F12 directly triggers hardware features, well known audio > volume up/down, brightness up/down...etc, which were triggered > by holding down Fn key and F-keys. > > Because there's no way to retrieve the fn-lock mode via existing > WMI methods per ASUS spec, driver need to initialize and keep the > fn-lock mode by itself. > > Signed-off-by: Chris Chiu Reviewed-by: Daniel Drake
Re: Detecting x86 LAPIC timer frequency from CPUID data
On Fri, Apr 19, 2019 at 6:30 AM Thomas Gleixner wrote: >Time Stamp Counter/Core Crystal Clock Information (0x15): > TSC/clock ratio = 168/2 > nominal core crystal clock = 0 Hz > >Processor Frequency Information (0x16): > Core Base Frequency (MHz) = 0x834 (2100) > Core Maximum Frequency (MHz) = 0xed8 (3800) > Bus (Reference) Frequency (MHz) = 0x64 (100) > > Assuming that TSC and local APIC timer run from the same frequency on these > modern machines. > >2100MHz * 2 / 168 = 25MHz > > and disabling the tsc deadline timer tells me: > > . calibration result: 24999 > > Close enough. I tested all the Intel SoC generations we have on hand. The assumption that the core crystal clock feeds the APIC seems to be consistently true. (Please note that all the following results are done with CONFIG_HZ=250, which is why the "calibration result" is 4x higher than HZ=1000 as used in previous mails) In the easy case, the low cost platforms do not support CPUID.0x16 (so no CPU frequency reporting), but they do tell us the core crystal clock, which is consistent with the APIC calibration result: N5000 (Gemini Lake) [0.122948] ... lapic delta = 11 [0.122948] ... PM-Timer delta = 357950 [0.122948] ... PM-Timer result ok [0.122948] . delta 11 [0.122948] . mult: 5153917 [0.122948] . calibration result: 76799 [0.122948] . CPU clock speed is 1094.1542 MHz. [0.122948] . host bus clock speed is 19.0799 MHz. Time Stamp Counter/Core Crystal Clock Information (0x15): TSC/clock ratio = 171/3 nominal core crystal clock = 1920 Hz Processor Frequency Information (0x16): Core Base Frequency (MHz) = 0x0 (0) Core Maximum Frequency (MHz) = 0x0 (0) Bus (Reference) Frequency (MHz) = 0x0 (0) N3350 (Apollo Lake) [0.248894] ... lapic delta = 11 [0.248894] ... PM-Timer delta = 357949 [0.248894] ... PM-Timer result ok [0.248894] . delta 11 [0.248894] . mult: 5153917 [0.248894] . calibration result: 76799 [0.248894] . CPU clock speed is 1094.1540 MHz. [0.248894] . host bus clock speed is 19.0799 MHz. Time Stamp Counter/Core Crystal Clock Information (0x15): TSC/clock ratio = 171/3 nominal core crystal clock = 1920 Hz (CPUID 0x16 not supported at all) And the 4 higher-end SoCs that we have available for testing all report crystal clock 0Hz from CPUID 0x15, but by combining the CPUID.0x16 base frequency with the CPUID.0x15 TSC/clock ratio, the crystal frequency can be calculated as you describe, and it consistently matches the APIC timer calibration result. i9-9980HK (Coffee Lake HR) [0.379421] ... lapic delta = 149998 [0.379421] ... PM-Timer delta = 357950 [0.379421] ... PM-Timer result ok [0.379421] . delta 149998 [0.379421] . mult: 6442365 [0.379421] . calibration result: 95998 [0.379421] . CPU clock speed is 2399.3902 MHz. [0.379421] . host bus clock speed is 23.3998 MHz. Time Stamp Counter/Core Crystal Clock Information (0x15): TSC/clock ratio = 200/2 nominal core crystal clock = 0 Hz Processor Frequency Information (0x16): Core Base Frequency (MHz) = 0x960 (2400) Core Maximum Frequency (MHz) = 0x1388 (5000) Bus (Reference) Frequency (MHz) = 0x64 (100) i7-8565U (Whiskey Lake) [0.173776] ... lapic delta = 149998 [0.173776] ... PM-Timer delta = 357950 [0.173776] ... PM-Timer result ok [0.173776] . delta 149998 [0.173776] . mult: 6442365 [0.173776] . calibration result: 95998 [0.173776] . CPU clock speed is 1991.3903 MHz. [0.173776] . host bus clock speed is 23.3998 MHz. Time Stamp Counter/Core Crystal Clock Information (0x15): TSC/clock ratio = 166/2 nominal core crystal clock = 0 Hz Processor Frequency Information (0x16): Core Base Frequency (MHz) = 0x7d0 (2000) Core Maximum Frequency (MHz) = 0x11f8 (4600) Bus (Reference) Frequency (MHz) = 0x64 (100) i5-7200U (Kabylake) [0.219142] ... lapic delta = 149998 [0.219142] ... PM-Timer delta = 357951 [0.219142] ... PM-Timer result ok [0.219142] . delta 149998 [0.219142] . mult: 6442365 [0.219142] . calibration result: 95998 [0.219142] . CPU clock speed is 2711.3880 MHz. [0.219142] . host bus clock speed is 23.3998 MHz. Time Stamp Counter/Core Crystal Clock Information (0x15): TSC/clock ratio = 226/2 nominal core crystal clock = 0 Hz Processor Frequency Information (0x16): Core Base Frequency (MHz) = 0xa8c (2700) Core Maximum Frequency (MHz) = 0xc1c (3100) Bus (Reference) Frequency (MHz) = 0x64 (100) m3-8110Y (Amber Lake) [0.102289] ... lapic delta = 149998 [0.102289] ... PM-Timer delta = 357951 [0.102289] ... PM-Timer result ok [0.102289] . delta 149998 [0.102289] . mult: 6442365 [
Detecting x86 LAPIC timer frequency from CPUID data
The CPUID.0x16 leaf provides "Bus (Reference) Frequency (in MHz)". In the thread "No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot" we are exploring ways to have the kernel avoid using the PIT/HPET IRQ0 timer in more cases, and Thomas Gleixner suggested that we could use this CPUID data to set lapic_timer_frequency, avoiding the need for calibrate_APIC_clock() to measure the APIC clock against the IRQ0 timer. I'm thinking of the the following code change, however I get unexpected results on Intel i7-8565U (Whiskey Lake). When booting without this change, and with apic=notscdeadline (so that APIC clock gets calibrated and used), the bus speed is detected as 23MHz: ... lapic delta = 149994 ... PM-Timer delta = 357939 ... PM-Timer result ok . delta 149994 . mult: 6442193 . calibration result: 23999 . CPU clock speed is 1991.0916 MHz. . host bus clock speed is 23.0999 MHz. However the CPUID.0x16 ECX reports a 100MHz bus speed on this device, so this code change would produce a significantly different calibration. Am I doing anything obviously wrong? diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 3fae23834069..6c51ce842f86 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -679,6 +679,16 @@ static unsigned long cpu_khz_from_cpuid(void) cpuid(0x16, _base_mhz, _max_mhz, _bus_mhz, ); +#ifdef CONFIG_X86_LOCAL_APIC + /* +* If bus frequency is provided in CPUID data, set +* global lapic_timer_frequency to bus_clock_cycles/jiffy. +* This avoids having to calibrate the APIC timer later. +*/ + if (ecx_bus_mhz) + lapic_timer_frequency = (ecx_bus_mhz * 100) / HZ; +#endif + return eax_base_mhz * 1000; } -- 2.19.1
Re: No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot
On Wed, Apr 10, 2019 at 8:54 PM Thomas Gleixner wrote: > On Tue, 9 Apr 2019, Daniel Drake wrote: > > On Wed, Apr 3, 2019 at 7:21 PM Thomas Gleixner wrote: > > > - Prevent the TSC calibration code from touching PIT/HPET. It > > >should do that already when the TSC frequency can be retrieved > > >via CPUID or MSR. Should work, emphasis on should ... > > > > >From above, this seems to be working acceptably already. It does touch > > the PIT, but ultimately ignores the information that it provided. > > Yes, but we might actually be smarter than that. Do you have anything specific in mind? You originally laid out this idea in the context of doing this if the PIT/HPET is not working. However, I can't immediately see how to judge that because: - According to the analysis in my last mail, the PIT is actually ticking even when it is gated in the BIOS. The BIOS setting just seems to make it tick 4 times slower and not generate any IRQ0 interrupts. - TSC calibration code runs really early during boot. To make it detect this situation we could make it check if IRQ0 is working, however setup_default_timer_irq() only happens a lot later, so I'm not sure how this could be checked at such an early stage. I think I'm now fairly clear on the other suggestions you have made so I'll see if I can come up with some patches. Thanks! Daniel
Re: [PATCH v2 10/11] platform/x86: asus-wmi: Switch fan boost mode
On Thu, Apr 11, 2019 at 1:47 PM Yurii Pavlovskyi wrote: > * 0x00 - is normal, > * 0x01 - is obviously turbo by the amount of noise, might be useful to > avoid CPU frequency throttling on high load, > * 0x02 - the meaning is unknown at the time as modes are not named > in the vendor documentation, but it does look like a quiet mode as CPU > temperature does increase about 10 degrees on maximum load. I'm curious which vendor documentation you're working with here. >From the spec, 0 = normal 1 = overboost 2 = silent Also you can use DSTS on the 0x00110018 device to check the exact capabilities supported, which you should use to refine which modes can be cycled through. Bit 0 = overboost supported Bit 1 = silent supported Thanks Daniel
Re: [PATCH v2 05/11] platform/x86: asus-wmi: Support queued WMI event codes
On Thu, Apr 11, 2019 at 1:44 PM Yurii Pavlovskyi wrote: > Event codes are expected to be polled from a queue on at least some > models. Maybe avoid the word "poll" since it suggests something else (at least to me). > The fix flushes the old key codes out of the queue on load and after > receiving event the queue is read until either .. or 1 is encountered. > > It might be considered a minor issue and no normal user would likely to > observe this (there is little reason unloading the driver), but it does > significantly frustrate a developer who is unlucky enough to encounter > this. > > Introduce functionality for flushing and processing queued codes, which is > enabled via quirk flag for ASUS7000. It might be considered if it is > reasonable to enable it everywhere (might introduce regressions) or always > try to flush the queue on module load and try to detect if this quirk is > present in the future. > > This patch limits the effect to the specific hardware defined by ASUS7000 > device that is used for driver detection by vendor driver of Fx505. The > fallback is also implemented in case initial flush fails. Which vendor driver are you referring to here? Researching more: In our database of 144 Asus models, 142 of them have GANQ. Of those 142, 9 of them return One in the empty-queue case. The other 133 match your FX505GM device exactly. So it seems valid to interpret both 0x and 0x1 as queue-end markers. The 2 devices that do not have GANQ are not laptops. They also do not have the _UID "ATK" WMI device, they only have "ASUSWMI" and their WMI _WED method does not use a queue. There are a few more units that have both ASUSWMI and ATK devices, and the ASUSWMI device appears to never be queued. Another observation is that the ASUSWMI device works with notify code 0xD2, and the ATK device works with 0xFF. Nailing this down a bit further, I found a DSDT for EEEPC X101H: that one only has ASUSWMI and it is also not queued. So I think you should make this queue behaviour applied more generically, but either avoid it when the WMI device _UID is not "ATK" (as discussed in the DCTS/DSTS thread), or avoid it when the notify code is not 0xFF. Thanks Daniel
Re: [PATCH 04/11] platform/x86: asus-wmi: Add quirk to force DSTS WMI method detection
On Thu, Apr 11, 2019 at 4:28 AM Yurii Pavlovskyi wrote: > The DSTS method detection fails, as nothing is returned if method is not > defined in WMNB. As a result the control of keyboard backlight is not > functional for TUF Gaming series laptops (at the time the only > functionality of the driver on this model implemented with WMI methods). > > Patch was tested on a newer TUF Gaming FX505GM and older K54C model. > > FX505GM: > Method (WMNB, 3, Serialized) > { ... > If ((Local0 == 0x53545344)) > { > ... > Return (Zero) > } > ... > // No return > } > > K54C: > Method (WMNB, 3, Serialized) > { ... > If ((Local0 == 0x53545344)) > { > ... > Return (0x02) > } > ... > Return (0xFFFE) > } > > The non-existing method ASUS_WMI_METHODID_DSTS=0x53544344 (actually it is > DCTS in little endian ASCII) is selected in asus->dsts. > > One way to fix this would be to call both for every known device ID until > some answers - this would increase module load time. > > Another option is to check some device that is known to exist on every > model - none known at the time. > > Last option, which is implemented, is to check for presence of the > ASUS7000 device in ACPI tree (it is a dummy device), which is the > condition used for loading the vendor driver for this model. This might > not fix every affected model ever produced, but it likely does not > introduce any regressions. The patch introduces a quirk that is enabled > when ASUS7000 is found. > > Scope (_SB) > { > Device (ATK) > { > Name (_HID, "ASUS7000") // _HID: Hardware ID > } > } Hmm, tricky! But about time we did something about the DSTS vs DCTS guessing. While we don't have definitive knowledge of the right thing to do here, I think I have enough info available to solidify some assumptions we'd otherwise be afraid to make, and then we can implement a better approach here. In our database of 144 Asus DSDTs, 14 of them respond to DCTS: AS_D520MT D425MC D640SA D320SF-K D415MT D830MT G11DF M32CD4-K V221ID V272UN_SKU3 Z240IE ZN220IC-K ZN241IC ZN270IE Of those 14, they all additionally respond to DSTS, except for D415MT and AS_D520MT. In all 14 cases, the DCTS handling is done inside a device with _UID ASUSWMI. None of the other 130 products have a device with that _UID. Furthermore, we recently received access to the ASUS spec, which confirms that the Instance Name for EeePC is "ACPI\PNP0C14\ASUSWMI_0" whereas the Instance Name for other notebooks is "ACPI\PNP0C14\ATK_0". The 12 devices that respond to both DCTS and DSTS, do it on separate different devices, albeit with the same _WDG UUID. The one with _UID ASUSWMI responds to DCTS, and the one with _UID ATK responds to DSTS. So that seems fairly convincing. My thinking is that we can check the _UID of the underlying device, and use DCTS for ASUSWMI, DSTS otherwise. To do that, I think you'd have to rework the driver probing so that it happens through wmi_driver_register(). Then from the probe function onwards you will get a wmi_device, and hopefully you can get the _UID through that instance somehow. There's a separate issue of what happens on those 12 machines where there are two WMI devs, with the same _WDG GUID. drivers/platform/x86/wmi.c drops duplicates, so one of them is being ignored in that case. We'd ideally find a way to ignore the ASUSWMI node and go with ATK in that situation. But I think this can be separated from your work here. Thanks for these patches and welcome to the world of kernel development! Daniel
Re: [PATCH 01/11] platform/x86: asus-wmi: Fix hwmon device cleanup
On Thu, Apr 11, 2019 at 4:21 AM Yurii Pavlovskyi wrote: > > The asus-wmi driver does not clean up the hwmon device on exit or error. > To reproduce the bug, repeat rmmod, insmod to verify that device number > /sys/devices/platform/asus-nb-wmi/hwmon/hwmon?? grows every time. Add > pointer to the device in module state and call cleanup on error. I wonder if this can be fixed more cleanly by using devm_hwmon_device_register_with_groups() ? Thanks Daniel
Re: No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot
On Wed, Apr 3, 2019 at 7:21 PM Thomas Gleixner wrote: > Btw, one of those links you provided > > https://www.manualslib.com/manual/1316475/Ecs-Ed20pa2.html?page=23 > > claims that you have to disable MWAIT as well. No idea why. Is MWAIT > disabled on your platform? I don't have that option in the BIOS. However there's no mention of "mwait" nor "mwaitx" in /proc/cpuinfo. Checking our more general database of 202 x86_64 consumer products released over the last few years, only 19 of them have mwait/mwaitx listed there and they tend to be older platforms. > We have early-quirks.c in arch/x86/kernel/ for that. Nice, we should be able to work around the issue there, but I hope we can find something better... > For newer CPUs we might assume that: > > 1) The TSC and APIC timer are actually usable > > 2) The frequencies can be retrieved from CPUID or MSRs > > If #1 and #2 are reliable we can avoid the whole calibration and interrupt > delivery mess. Let's take a step back and re-examine the wider sequence of events here (which I've now done thanks to your pointers). 1. In very early boot, we face the TSC calibration challenge, arriving at determine_cpu_tsc_frequencies(). This function calculates CPU frequency and TSC frequency separately. For the CPU frequency, native_calibrate_cpu_early() tries to do it via cpu_khz_from_cpuid() with CPUID leaf 0x16, but this is not available on the platforms in question, which have max cpuid level 0x15. cpu_khz_from_msr() is then tried, but that doesn't support this platform either (looks like it only supports older SoC generations). So now we arrive in quick_pit_calibrate(), which directly programs the PIT and measures the TSC rate against the PIT ticks. When the 8254 is ungated in the BIOS, this function fails early because: if (pit_expect_msb(0xff, , )) { /* returned at count=13, d1 is now 32118 */ for (i = 1; i <= MAX_QUICK_PIT_ITERATIONS; i++) { if (!pit_expect_msb(0xff-i, , )) /* returned at count=13, d2 is now 48595 */ break; delta -= tsc; /* delta is now 246741 */ /* * Extrapolate the error and fail fast if the error will * never be below 500 ppm. */ if (i == 1 && d1 + d2 >= (delta * MAX_QUICK_PIT_ITERATIONS) >> 11) return 0; /* this return statement is hit, the calculation is: 32118 + 48595 >= (246741 * 233) >> 11 */ so the error was too high (I'm not sure why) and determine_cpu_tsc_frequencies() records the CPU frequency as 0. Then, comparing to when the 8254 is gated via the BIOS option, the behaviour is surprising too. In that case, the quick_calibrate_pit() loop runs through up to i=128, at which point the error level is low enough to be accepted, calculating the CPU frequency as 4448MHz (4x higher than reality). During each loop iteration, pit_expect_msb() returns when the value changes at count=63 (compared to 13 in the PIT-ungated case). Does this suggest the PIT is not actually fully gated, it's just ticking a lot slower than otherwise? Anyway, back in determine_cpu_tsc_frequencies() with the CPU frequency calibration done, we now do TSC calibration. This one succeeds in all cases via native_calibrate_tsc() using CPUID leaf 0x15 to read the correct value. The TSC is 1094MHz. Then, in both cases (8254 gated or not), the CPU frequency calculation is discarded here, because it's wildly different from the TSC rate: else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz) cpu_khz = tsc_khz; So it seems that this code already behaves along the lines you describe: it gives more trust to the TSC value read in the modern way, and does not get upset if the CPU frequency calibration against the PIT didn't produce a meaningful result. 2. Significantly later during boot, x86_late_time_init() calls hpet_time_init() which sets up either the PIT or HPET. However, as far as I can see, there is no checking that the selected clock is actually ticking. In the case of these affected platforms with the 8254 gated, we sail right pass this point without a working clock source. 3. x86_late_time_init() then calls apic_intr_mode_init() -> apic_bsp_setup() -> setup_IO_APIC() and at this point we reach check_timer(), which attempts to verify/fixup IRQ0 delivery via the IO-APIC. At this point we check that jiffies increments, and if not, panic. 4. Some time later, naive_smp_prepare_cpus() calls setup_boot_APIC_clock() -> setup_APIC_timer() which registers the local APIC clocksource, replacing the previous PIT/HPET clocksource. There's no check to make sure that the new clocksource is ticking, as far as I can see. 5. Some time later, start_secondary() calls start_secondary_APIC_clock() -> setup_APIC_timer() registering the APIC clocksource (again? or just for another CPU core?). Hopefully that analysis helps refine/elaborate the plan a bit more... > That means we need the following decision logic: > > 1) If HPET is
No 8254 PIT & no HPET on new Intel N3350 platforms causes kernel panic during early boot
Hi, I already wrote about this problem in the thread "APIC timer checked before it is set up, boot fails on Connex L1430" https://lkml.org/lkml/2018/12/28/10 However my initial diagnosis was misguided, and I have some new findings to share now, so I'm starting over in this new thread. Also CCing Hans, who also often attracts this class of problem on low cost hardware! The problem is that on affected platforms, all Linux distros (and all known kernel versions) fail to boot, hanging on a black screen. EFI earlyprintk can be used to see the panic: APIC: switch to symmetric I/O mode setup x2apic: IRQ remapping doesn't support X2APIC mode ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...tryign to set up timer (IRQ0) through the 8259A ... . (found apic 0 pin 2) ... ... failed. ...trying to set up timer as Virtual Wire IRQ... . failed. ...trying to set up timer as ExtINT IRQ... do_IRQ: 0.55 No irq handler for vector . failed :(. Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. After encountering this on Connex L1430 last time, we have now encountered another affected product, from a different vendor (SCOPE SN116PYA). They both have Intel Apollo Lake N3350 and AMI BIOS. The code in question is making sure that the IRQ0 timer works, by waiting for an interrupt. In this case there is no interrupt. The x86 platform code in hpet_time_init() tries to enable the HPET timer for this, however that is not available on these affected platforms (no HPET ACPI table). So it then falls back on the 8253/8254 legacy PIT. The i8253.c driver is invoked to program the PIT accordingly, however in this case it does not result in any IRQ0 interrupts being generated --> panic. I found a relevant setting in the BIOS: Chipset -> South Cluster Configuration -> Miscellaneous Configuration -> 8254 Clock Gating This option is set to Enabled by default. Setting it to Disabled makes the PIT tick and Linux boot finally works. It's nice to have a workaround but I would hope we could do better - especially because it seems like this problem is spreading. In addition to the two products we found here, searching around finds several other product manuals and discussions that tell you to go into the BIOS and change this option if you want Linux to boot, some examples: https://blog.csdn.net/qhtsm/article/details/88600316 https://www.manualslib.com/manual/1316475/Ecs-Ed20pa2.html?page=23 https://tools.exone.de/live/shop/img/produkte/fs_112124_2.pdf page 11 As another data point, Windows 10 boots fine in this no-PIT no-HPET configuation. Going deeper, I found the clock_gate_8254 option in the coreboot source code. This pointed me to the ITSSPRC register, which is documented on page 1694 of https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf "8254 Static Clock Gating Enable (CGE8254): When set, the 8254 timer is disabled statically. This bit shall be set by BIOS if the 8254 feature is not needed in the system or before BIOS hands off the system that supports C11. Normal operation of 8254 requires this bit to 0." (what's C11?) I verified that the BIOS setting controls this specific bit value, and I also created and verified a workaround that unsets this bit - now Linux boots fine regardless of the BIOS setting: #define INTEL_APL_PSR_BASE0xd000 #define INTEL_APL_PID_ITSS0xd0 #define INTEL_PCR_PORTID_SHIFT16 #define INTEL_APL_PCR_ITSSPRC0x3300 static void quirk_intel_apl_8254(void) { u32 addr = INTEL_APL_PSR_BASE | \ (INTEL_APL_PID_ITSS << INTEL_PCR_PORTID_SHIFT) | \ INTEL_APL_PCR_ITSSPRC; u32 value; void __iomem *itssprc = ioremap_nocache(addr, 4); if (!itssprc) return; value = readl(itssprc); if (value & 4) { value &= ~4; writel(value, itssprc); } iounmap(itssprc); } I was hoping I could send a workaround patch here, but I'm not sure of an appropriate way to detect that we are on an Intel Apollo Lake platform. This timer stuff happens during early boot, the early quirks in pci/quirks.c run too late for this. Suggestions appreciated. Poking at other angles, I tried taking the HPET ACPI table from another (working) Intel N3350 system and putting it in the initrd as an override. This makes the HPET work fine, at which point Linux boots OK without having to touch the (BIOS-crippled) PIT. I also spotted that GRUB was previously affected by this BIOS-level behaviour change. http://git.savannah.gnu.org/cgit/grub.git/commit/?id=446794de8da4329ea532cbee4ca877bcafd0e534 Apparently GRUB used to rely on the 8254 PIT too, but it now uses the pmtimer for TSC calibration instead. I guess the originally-affected platforms only ran into GRUB freezing here (as opposed to both GRUB and Linux freezing) because those platforms had a working HPET, meaning that Linux was
Re: [PATCH] pinctrl: intel: save HOSTSW_OWN register over suspend/resume
On Thu, Mar 28, 2019 at 5:17 PM Andy Shevchenko wrote: > Hmm... Can you confirm that laptop you declared as a fixed case and the > mentioned here is the same one? They are definitely not the same exact unit - originally we had a pre-production sample, and now we briefly diagnosed a real production unit that was sold to a customer. There could be subtle motherboard variations as you mention. > If it's the case, I recommend to ping Asus again and make them check and fix. We'll keep an eye open for any opportunities to go deeper here. However further investigation on both our side and theirs is blocked by not having any of the affected hardware (since the models are now so old), so I'm not very optimistic that we'll be able to make progress there. > Meanwhile, Mika's proposal sounds feasible and not so intrusive. We may > implement this later on. Chris will work on implementing this for your consideration. Thanks for the quick feedback! Daniel
Re: [PATCH] pinctrl: intel: save HOSTSW_OWN register over suspend/resume
Hi Mika, Digging up this old thread again... On Tue, Nov 21, 2017 at 8:13 PM Mika Westerberg wrote: > > On Tue, Nov 21, 2017 at 07:54:26PM +0800, Chris Chiu wrote: > > Yup, I checked the value of the corresponded pin. It shows following before > > suspend > > pin 18 (GPIO_18) GPIO 0x40800102 0x00024075 > > > > Then after resume > > pin 18 (GPIO_18) GPIO 0x40800102 0x00024075 [ACPI] > > OK, so ownership is changed to ACPI. > > > What else register do you suggest me to compare? The PADCFG2 is invalid > > It's fine APL does not have PADCFG2. > > Hmm, I don't understand how this can work in Windows either. The Windows > people told me that they don't save and restore anything else than > padcfg registers + ie. If the ownership is changed to ACPI it means you > don't get interrupts anymore (only GPEs) and that applies to Windows as > well. In the mails after the one quoted above, we reported back to you that the new BIOS from Asus solved the issue. However, during the time that has followed, we have had numerous user reports from Asus E403NA, X540NA, and X541NA laptops (basically the same models that we originally discussed) where the touchpad stops working after suspend/resume, even with the latest BIOS. We managed to get an affected E403NA unit in-hands again, and confirmed that HOSTSW_OWN was being lost like we had observed before. Unfortunately as this was a customer laptop we had to return it immediately, before we could investigate further. We don't have access to any more units since they are old models now. However I'm wondering if you have any other ideas or if you think something like our workaround patch might be acceptable under these circumstances: https://github.com/endlessm/linux/commit/f391452299f62a3d0cbe5333be90f69e9895d8ff Thanks Daniel
Re: [PATCH 3/3] Bluetooth: btrtl: Skip initialization if firmware is already loaded
On Thu, Jan 24, 2019 at 11:23 PM Kai-Heng Feng wrote: > Realtek bluetooth may not work after reboot: > [ 12.446130] Bluetooth: hci0: RTL: rtl: unknown IC info, lmp subver a99e, > hci rev 826c, hci ver 0008 > > The power is not cut during system reboot, so the firmware is kept in > Bluetooth controller. > > Realtek bluetooth doesn't have the ability to check firmware loading > status. but the version queried by HCI_OP_READ_LOCAL_VERSION will be > different if firmware is already loaded. Realtek's own fork, rtk_btusb > also use this method to detect the loading status. > > So let's assume the firmware is already loaded when we can't find > matching IC info. This logic was already present in the driver - but it looks like this regressed at this point: commit 26503ad25de8c7c93a2037f919c2e49a62cf65f1 Author: Martin Blumenstingl Date: Thu Aug 2 16:57:13 2018 +0200 Bluetooth: btrtl: split the device initialization into smaller parts After your patch it is effectively there in two places now, since it is also in btrtl_download_firmware() (although not really effective after the above commit). I wonder if that can be cleaned up to avoid duplication. Regarding the other patches that move away from the style of returning either a useful value or an error, is this purely a stylistic thing or is it needed for your 3rd patch? I don't have strong feelings either way but I have the impression that the currently implemented approach is a common style within kernel code and I don't see benefit in splitting off a separate out parameter. Daniel > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201921 > Signed-off-by: Kai-Heng Feng > --- > drivers/bluetooth/btrtl.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/drivers/bluetooth/btrtl.c b/drivers/bluetooth/btrtl.c > index c36f500d8313..e2f89d57dd14 100644 > --- a/drivers/bluetooth/btrtl.c > +++ b/drivers/bluetooth/btrtl.c > @@ -546,9 +546,10 @@ int btrtl_initialize(struct hci_dev *hdev, > hdev->bus); > > if (!btrtl_dev->ic_info) { > - rtl_dev_err(hdev, "rtl: unknown IC info, lmp subver %04x, hci > rev %04x, hci ver %04x", > - lmp_subver, hci_rev, hci_ver); > - return -EINVAL; > + rtl_dev_info(hdev, "rtl: unknown IC info, lmp subver %04x, > hci rev %04x, hci ver %04x", > +lmp_subver, hci_rev, hci_ver); > + rtl_dev_info(hdev, "rtl: firmware may be already loaded, or > it's an unsupported IC."); > + return 0; > } > > if (btrtl_dev->ic_info->has_rom_version) { > @@ -621,7 +622,8 @@ int btrtl_setup_realtek(struct hci_dev *hdev) > if (ret) > return ret; > > - ret = btrtl_download_firmware(hdev, _dev); > + if (btrtl_dev.ic_info) > + ret = btrtl_download_firmware(hdev, _dev); > > return ret; > } > -- > 2.17.1 >
[tip:x86/urgent] x86/kaslr: Fix incorrect i8254 outb() parameters
Commit-ID: 7e6fc2f50a3197d0e82d1c0e86282976c9e6c8a4 Gitweb: https://git.kernel.org/tip/7e6fc2f50a3197d0e82d1c0e86282976c9e6c8a4 Author: Daniel Drake AuthorDate: Mon, 7 Jan 2019 11:40:24 +0800 Committer: Thomas Gleixner CommitDate: Fri, 11 Jan 2019 21:35:47 +0100 x86/kaslr: Fix incorrect i8254 outb() parameters The outb() function takes parameters value and port, in that order. Fix the parameters used in the kalsr i8254 fallback code. Fixes: 5bfce5ef55cb ("x86, kaslr: Provide randomness functions") Signed-off-by: Daniel Drake Signed-off-by: Thomas Gleixner Cc: b...@alien8.de Cc: h...@zytor.com Cc: li...@endlessm.com Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20190107034024.15005-1-dr...@endlessm.com --- arch/x86/lib/kaslr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c index 79778ab200e4..a53665116458 100644 --- a/arch/x86/lib/kaslr.c +++ b/arch/x86/lib/kaslr.c @@ -36,8 +36,8 @@ static inline u16 i8254(void) u16 status, timer; do { - outb(I8254_PORT_CONTROL, -I8254_CMD_READBACK | I8254_SELECT_COUNTER0); + outb(I8254_CMD_READBACK | I8254_SELECT_COUNTER0, +I8254_PORT_CONTROL); status = inb(I8254_PORT_COUNTER0); timer = inb(I8254_PORT_COUNTER0); timer |= inb(I8254_PORT_COUNTER0) << 8;
Re: APIC timer checked before it is set up, boot fails on Connex L1430
On Fri, Dec 28, 2018 at 2:11 PM Daniel Drake wrote: > On the Connex L1430 laptop based on Intel Apollo Lake N3350, Linux > doesn't boot. It hangs early on a blank screen. Reproduced with Linus > git, 4.18 and 4.19 (there is no previous known working kernel > version). EFI earlyprintk shows: > > APIC: switch to symmetric I/O mode setup > x2apic: IRQ remapping doesn't support X2APIC mode > ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > ..MP-BIOS bug: 8254 timer not connected to IO-APIC > ...tryign to set up timer (IRQ0) through the 8259A ... > . (found apic 0 pin 2) ... > ... failed. > ...trying to set up timer as Virtual Wire IRQ... > . failed. > ...trying to set up timer as ExtINT IRQ... > do_IRQ: 0.55 No irq handler for vector > . failed :(. > Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with > apic=debug and send a report. > > Looking closer, check_timer() is observing that the IOAPIC timer > doesn't tick, so it then tries some other approaches but doesn't > manage to get them working either. > > The strange thing is, I booted with the no_timer_check parameter and > the system works fine! With this parameter it assumes the IOAPIC timer > is ticking and just continues the boot sequence anyway. Here is the > boot log with apic=debug no_timer_check: > https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-dmesg-txt > > /proc/interrupts shows that APIC Local timer interrupts are working > fine on both CPUs: > https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-interrupts-txt > > So, check_timer() is incorrectly deducing that the IOAPIC timer isn't > working. The way it checks this is to do a delay loop and then check > if jiffies has advanced. I presume the expectation here is that during > this delay, the hardware IRQ will invoke local_apic_timer_interrupt() > which will then increment jiffies. Indeed, during check_timer() > execution this interrupt does not fire, however by using > no_timer_check and adding a log message I can see that it fires for > the first time quite some time after check_timer() is done: > > ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > clocksource: tsc-early: mask: 0x max_cycles: > 0xfc66f4fc7c, max_idle_ns: 440795224246 ns > Calibrating delay loop (skipped), value calculated using timer > frequency.. 2188.80 BogoMIPS (lpj=1094400) > pid_max: default: 32768 minimum: 301 > LSM: Security Framework initializing > SELinux: Initializing. > Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) > Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) > Mount-cache hash table entries: 4096 (order: 3, 32768 bytes) > Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes) > mce: CPU supports 7 MCE banks > mce: CPU0: Thermal monitoring enabled (TM1) > Last level iTLB entries: 4KB 48, 2MB 0, 4MB 0 > Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 > Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; > no mitigation available! > Freeing SMP alternatives memory: 44K > TSC deadline timer enabled > smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6, > model: 0x5c, stepping: 0x9) > Performance Events: PEBS fmt3+, Goldmont events, 32-deep LBR, > full-width counters, Intel PMU driver. > ... version:4 > ... bit width: 48 > ... generic registers: 4 > ... value mask: > ... max period: 7fff > ... fixed-purpose events: 3 > ... event mask: 0007000f > rcu: Hierarchical SRCU implementation. > smp: Bringing up secondary CPUs ... > !!! local_apic_timer_interrupt for the first time cpu0 !!! > > Experimenting further, I used the same approach of adding delays and > checking for the interrupt during the delay to figure out at which > precise point during the boot sequence the timer interrupt starts > working. It's here: > > static void setup_APIC_timer(void) > { > [...] > if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) { > [...] > clockevents_config_and_register(levt, > tsc_khz * (1000 / TSC_DIVISOR), > 0xF, ~0UL); > } > } > > We reach clockevents_register_device() which does: > 1. Take a spinlock and disable IRQs > 2. lapic_set_oneshot() which leads to "TSC deadline timer enabled" message > 3. lapic_next_deadline() > 4. Spin unlock & re-enable IRQs > > At the exact point where IRQs are re-enabled above, which is at the > time of return from clockevents_config_and_register(), timer > interrupts start working. > > > The ov
[PATCH] kaslr: fix incorrect i8254 outb parameters
The outb call takes parameters value and port, in that order. Fix the parameters used in the kalsr i8254 fallback code. Signed-off-by: Daniel Drake --- arch/x86/lib/kaslr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c index 79778ab200e4..a53665116458 100644 --- a/arch/x86/lib/kaslr.c +++ b/arch/x86/lib/kaslr.c @@ -36,8 +36,8 @@ static inline u16 i8254(void) u16 status, timer; do { - outb(I8254_PORT_CONTROL, -I8254_CMD_READBACK | I8254_SELECT_COUNTER0); + outb(I8254_CMD_READBACK | I8254_SELECT_COUNTER0, +I8254_PORT_CONTROL); status = inb(I8254_PORT_COUNTER0); timer = inb(I8254_PORT_COUNTER0); timer |= inb(I8254_PORT_COUNTER0) << 8; -- 2.19.1
APIC timer checked before it is set up, boot fails on Connex L1430
Hi, On the Connex L1430 laptop based on Intel Apollo Lake N3350, Linux doesn't boot. It hangs early on a blank screen. Reproduced with Linus git, 4.18 and 4.19 (there is no previous known working kernel version). EFI earlyprintk shows: APIC: switch to symmetric I/O mode setup x2apic: IRQ remapping doesn't support X2APIC mode ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...tryign to set up timer (IRQ0) through the 8259A ... . (found apic 0 pin 2) ... ... failed. ...trying to set up timer as Virtual Wire IRQ... . failed. ...trying to set up timer as ExtINT IRQ... do_IRQ: 0.55 No irq handler for vector . failed :(. Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Looking closer, check_timer() is observing that the IOAPIC timer doesn't tick, so it then tries some other approaches but doesn't manage to get them working either. The strange thing is, I booted with the no_timer_check parameter and the system works fine! With this parameter it assumes the IOAPIC timer is ticking and just continues the boot sequence anyway. Here is the boot log with apic=debug no_timer_check: https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-dmesg-txt /proc/interrupts shows that APIC Local timer interrupts are working fine on both CPUs: https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-interrupts-txt So, check_timer() is incorrectly deducing that the IOAPIC timer isn't working. The way it checks this is to do a delay loop and then check if jiffies has advanced. I presume the expectation here is that during this delay, the hardware IRQ will invoke local_apic_timer_interrupt() which will then increment jiffies. Indeed, during check_timer() execution this interrupt does not fire, however by using no_timer_check and adding a log message I can see that it fires for the first time quite some time after check_timer() is done: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 clocksource: tsc-early: mask: 0x max_cycles: 0xfc66f4fc7c, max_idle_ns: 440795224246 ns Calibrating delay loop (skipped), value calculated using timer frequency.. 2188.80 BogoMIPS (lpj=1094400) pid_max: default: 32768 minimum: 301 LSM: Security Framework initializing SELinux: Initializing. Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) Mount-cache hash table entries: 4096 (order: 3, 32768 bytes) Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes) mce: CPU supports 7 MCE banks mce: CPU0: Thermal monitoring enabled (TM1) Last level iTLB entries: 4KB 48, 2MB 0, 4MB 0 Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available! Freeing SMP alternatives memory: 44K TSC deadline timer enabled smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6, model: 0x5c, stepping: 0x9) Performance Events: PEBS fmt3+, Goldmont events, 32-deep LBR, full-width counters, Intel PMU driver. ... version:4 ... bit width: 48 ... generic registers: 4 ... value mask: ... max period: 7fff ... fixed-purpose events: 3 ... event mask: 0007000f rcu: Hierarchical SRCU implementation. smp: Bringing up secondary CPUs ... !!! local_apic_timer_interrupt for the first time cpu0 !!! Experimenting further, I used the same approach of adding delays and checking for the interrupt during the delay to figure out at which precise point during the boot sequence the timer interrupt starts working. It's here: static void setup_APIC_timer(void) { [...] if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) { [...] clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL); } } We reach clockevents_register_device() which does: 1. Take a spinlock and disable IRQs 2. lapic_set_oneshot() which leads to "TSC deadline timer enabled" message 3. lapic_next_deadline() 4. Spin unlock & re-enable IRQs At the exact point where IRQs are re-enabled above, which is at the time of return from clockevents_config_and_register(), timer interrupts start working. The overall ordering here seems surprising. check_timer() is probing whether the APIC timer works well before setup_APIC_timer() has been called. Shouldn't the timer be checked only after it has been set up? Or is Linux assuming that the BIOS will boot with the APIC timer already running? Any other debugging suggestions much appreciated. Thanks Daniel
Re: [PATCH] ACPI / battery: Fix reporting "Not charging" when capacity is 100%
On Mon, Nov 5, 2018 at 1:19 AM Pavel Machek wrote: > Plus, I don't think "100% charge" is right test for "battery full". At > least on thinkpads, there's configuration option, and it is common > _not_ to charge batterry above 95% or so (to increase its lifetime). Hans also touched on this area in his response: > As for this kernel-side fix I do not believe that fixing thus in > the kernel is the right thing to do. We try to stay away from > heuristics using full_charge_capacity in the kernel since that > is not really reliable / deterministic. I'm not fully convinced by this argument though. The ACPI spec is not very clear on what conditions you should apply to decide when the battery is full. Instead, ACPI seems to provide a pretty decent amount of data, and the decision about whether to interpret that as "battery full" is left for consumers. The best that the ACPI spec offers here is the _BIF/_BIX "Last Full Charge Capacity" field (known as full_charge_capacity in this driver), and it does appear to consider the "only charge to 95%" consideration, although it is only documented as a prediction. There is also a sample calculation (section 3.9.3) of how to calculate the Remaining Battery Percentage which uses "Last Full Charge Capacity" too. Without a crystal clear definition of battery full, it seems up to the consumers of the data to decide when the battery is full via some form of heuristic. And this is exactly what acpi-battery already does in acpi_battery_is_charged() - and the main aspect is looking at full_charge_capacity. The result of this is then sent to userspace. So it does not seem fair to say that "you can't use basic heuristics around full_charge_capacity" because the kernel already does (and this is what the ACPI spec encourages), although JP's patch could be improved to more clearly and more completely follow the logic in acpi_battery_is_charged(). The question of whether we would want to interpret the numbers in this way when DISCHARGING is set is a good one though. The ACPI spec doesn't shine much light on this, but the calculations related to battery percentage and battery life in section 3.9.3 at least do not seem to consider charging/discharging status. Thanks Daniel
Re: [PATCH] ACPI / battery: Fix reporting "Not charging" when capacity is 100%
On Mon, Nov 5, 2018 at 1:19 AM Pavel Machek wrote: > Plus, I don't think "100% charge" is right test for "battery full". At > least on thinkpads, there's configuration option, and it is common > _not_ to charge batterry above 95% or so (to increase its lifetime). Hans also touched on this area in his response: > As for this kernel-side fix I do not believe that fixing thus in > the kernel is the right thing to do. We try to stay away from > heuristics using full_charge_capacity in the kernel since that > is not really reliable / deterministic. I'm not fully convinced by this argument though. The ACPI spec is not very clear on what conditions you should apply to decide when the battery is full. Instead, ACPI seems to provide a pretty decent amount of data, and the decision about whether to interpret that as "battery full" is left for consumers. The best that the ACPI spec offers here is the _BIF/_BIX "Last Full Charge Capacity" field (known as full_charge_capacity in this driver), and it does appear to consider the "only charge to 95%" consideration, although it is only documented as a prediction. There is also a sample calculation (section 3.9.3) of how to calculate the Remaining Battery Percentage which uses "Last Full Charge Capacity" too. Without a crystal clear definition of battery full, it seems up to the consumers of the data to decide when the battery is full via some form of heuristic. And this is exactly what acpi-battery already does in acpi_battery_is_charged() - and the main aspect is looking at full_charge_capacity. The result of this is then sent to userspace. So it does not seem fair to say that "you can't use basic heuristics around full_charge_capacity" because the kernel already does (and this is what the ACPI spec encourages), although JP's patch could be improved to more clearly and more completely follow the logic in acpi_battery_is_charged(). The question of whether we would want to interpret the numbers in this way when DISCHARGING is set is a good one though. The ACPI spec doesn't shine much light on this, but the calculations related to battery percentage and battery life in section 3.9.3 at least do not seem to consider charging/discharging status. Thanks Daniel
Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
Hi Suren On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan wrote: > Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8 > code system running Android. Signals behave as expected reacting to > memory pressure, no jumps in "total" counters that would indicate an > overflow/underflow issues. Nicely done! Can you share your Linux v4.9 psi backport somewhere? Thanks Daniel
Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
Hi Suren On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan wrote: > Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8 > code system running Android. Signals behave as expected reacting to > memory pressure, no jumps in "total" counters that would indicate an > overflow/underflow issues. Nicely done! Can you share your Linux v4.9 psi backport somewhere? Thanks Daniel
Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
On Thu, Sep 6, 2018 at 5:43 AM, Johannes Weiner wrote: > Peter, do the changes from v3 look sane to you? > > If there aren't any further objections, I was hoping we could get this > lined up for 4.20. That would be excellent. I just retested the latest version at http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results are great. Test setup: Endless OS GeminiLake N4200 low end laptop 2GB RAM swap (and zram swap) disabled Baseline test: open a handful of large-ish apps and several website tabs in Google Chrome. Results: after a couple of minutes, system is excessively thrashing, mouse cursor can barely be moved, UI is not responding to mouse clicks, so it's impractical to recover from this situation as an ordinary user Add my simple killer: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd Results: when the thrashing causes the UI to become sluggish, the killer steps in and kills something (usually a chrome tab), and the system remains usable. I repeatedly opened more apps and more websites over a 15 minute period but I wasn't able to get the system to a point of UI unresponsiveness. Thanks, Daniel
Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
On Thu, Sep 6, 2018 at 5:43 AM, Johannes Weiner wrote: > Peter, do the changes from v3 look sane to you? > > If there aren't any further objections, I was hoping we could get this > lined up for 4.20. That would be excellent. I just retested the latest version at http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results are great. Test setup: Endless OS GeminiLake N4200 low end laptop 2GB RAM swap (and zram swap) disabled Baseline test: open a handful of large-ish apps and several website tabs in Google Chrome. Results: after a couple of minutes, system is excessively thrashing, mouse cursor can barely be moved, UI is not responding to mouse clicks, so it's impractical to recover from this situation as an ordinary user Add my simple killer: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd Results: when the thrashing causes the UI to become sluggish, the killer steps in and kills something (usually a chrome tab), and the system remains usable. I repeatedly opened more apps and more websites over a 15 minute period but I wasn't able to get the system to a point of UI unresponsiveness. Thanks, Daniel
Re: Built in PS2 keyboard in new ASUS/acer laptops can not wake up system after s2idle
On Mon, Aug 6, 2018 at 7:17 PM, Rafael J. Wysocki wrote: >> 'echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup' can get the >> keyboard wake up the system as expected. We considered to work out a DMI >> based quirk for this. But based on the information that EC would not signal >> SCI >> event for WiskyLake models, we'd like to know if there's any generic >> solution? >> Maybe a 'udev' rule to identify WiskyLake NoteBook then enable the keyboard >> wake up? > > A udev rule sounds like a good idea to me. What would the udev rule look like though? Match for Intel CPU generation (WhiskyLake) and laptop chassis type and then enable i8042 wakeups? While that seems like the most accurate reflection of the situation which we are seeing across multiple vendors, it doesn't feel right and seems unlikely to be accepted by systemd upstream. In previous designs, pressing a key while the system was in S3 sleep would cause a SCI interrupt due to the firing of the EC GPE, which effectively meant that keyboard wakeups were on by default and could not be disabled. Also USB keyboards have wakeups on by default (see usbhid_start()). Just these new platforms have this unfortunate/accidental behaviour change... Would it make sense to turn i8042 wakeups on by default on the kernel side? I don't know if any particular conditions are applied, but that would appear to be the default Win10 behaviour here. Daniel
Re: Built in PS2 keyboard in new ASUS/acer laptops can not wake up system after s2idle
On Mon, Aug 6, 2018 at 7:17 PM, Rafael J. Wysocki wrote: >> 'echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup' can get the >> keyboard wake up the system as expected. We considered to work out a DMI >> based quirk for this. But based on the information that EC would not signal >> SCI >> event for WiskyLake models, we'd like to know if there's any generic >> solution? >> Maybe a 'udev' rule to identify WiskyLake NoteBook then enable the keyboard >> wake up? > > A udev rule sounds like a good idea to me. What would the udev rule look like though? Match for Intel CPU generation (WhiskyLake) and laptop chassis type and then enable i8042 wakeups? While that seems like the most accurate reflection of the situation which we are seeing across multiple vendors, it doesn't feel right and seems unlikely to be accepted by systemd upstream. In previous designs, pressing a key while the system was in S3 sleep would cause a SCI interrupt due to the firing of the EC GPE, which effectively meant that keyboard wakeups were on by default and could not be disabled. Also USB keyboards have wakeups on by default (see usbhid_start()). Just these new platforms have this unfortunate/accidental behaviour change... Would it make sense to turn i8042 wakeups on by default on the kernel side? I don't know if any particular conditions are applied, but that would appear to be the default Win10 behaviour here. Daniel
Making direct reclaim fail when thrashing
Split from the thread [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 where we were discussing if/how to make the direct reclaim codepath fail if we're excessively thrashing, so that the OOM killer might step in. This is potentially desirable when the thrashing is so bad that the UI stops responding, causing the user to pull the plug. On Tue, Jul 17, 2018 at 7:23 AM, Michal Hocko wrote: > mm/workingset.c allows for tracking when an actual page got evicted. > workingset_refault tells us whether a give filemap fault is a recent > refault and activates the page if that is the case. So what you need is > to note how many refaulted pages we have on the active LRU list. If that > is a large part of the list and if the inactive list is really small > then we know we are trashing. This all sounds much easier than it will > eventually turn out to be of course but I didn't really get to play with > this much. Apologies in advance for any silly mistakes or terrible code that follows, as I am not familiar in this part of the kernel. As mentioned in my last mail, knowing if a page on the active list was refaulted into place appears not trivial, because the eviction information was lost upon refault (it was stored in the page cache shadow entry). Here I'm experimenting by adding another tag to the page cache radix tree, tagging pages that were activated in the refault path. And then in get_scan_count I'm checking how many active pages have that tag, and also looking at the size of the active and inactive lists. It has a performance blow (probably due to looping over the whole active list and doing lots of locking?) but I figured it might serve as one step forward. The results are not exactly as I would expect. Upon launching 20 processes that allocate and memset 100mb RAM each, exhausting all RAM (and no swap available), the kernel starts thrashing and I get numbers like: get_scan_count lru1 active=422714 inactive=19595 refaulted=0 get_scan_count lru3 active=832 inactive=757 refaulted=21 Lots of active anonymous pages (lru1), and none refaulted, perhaps not surprising because it can't swap them out, no swap available. But only few file pages on the lists (lru3), and only a tiny number of refaulted ones, which doesn't line up with your suggestion of detecting when a large part of the active list is made up of refaulted pages. Any further suggestions appreciated. Thanks Daniel --- include/linux/fs.h | 1 + include/linux/radix-tree.h | 2 +- mm/filemap.c | 2 ++ mm/vmscan.c| 37 + 4 files changed, 41 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index d85ac9d24bb3..45f94ffd1c67 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -462,6 +462,7 @@ struct block_device { #define PAGECACHE_TAG_DIRTY0 #define PAGECACHE_TAG_WRITEBACK1 #define PAGECACHE_TAG_TOWRITE 2 +#define PAGECACHE_TAG_REFAULTED3 int mapping_tagged(struct address_space *mapping, int tag); diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 34149e8b5f73..86eccb71ef7e 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -65,7 +65,7 @@ static inline bool radix_tree_is_internal_node(void *ptr) /*** radix-tree API starts here ***/ -#define RADIX_TREE_MAX_TAGS 3 +#define RADIX_TREE_MAX_TAGS 4 #ifndef RADIX_TREE_MAP_SHIFT #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) diff --git a/mm/filemap.c b/mm/filemap.c index 250f675dcfb2..9a686570dc75 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -917,6 +917,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, */ if (!(gfp_mask & __GFP_WRITE) && shadow && workingset_refault(shadow)) { + radix_tree_tag_set(>i_pages, page_index(page), + PAGECACHE_TAG_REFAULTED); SetPageActive(page); workingset_activation(page); } else diff --git a/mm/vmscan.c b/mm/vmscan.c index 03822f86f288..79bc810b43bb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2102,6 +2102,30 @@ enum scan_balance { SCAN_FILE, }; + +static int count_refaulted(struct lruvec *lruvec, enum lru_list lru) { + int nr_refaulted = 0; + struct page *page; + + list_for_each_entry(page, >lists[lru], lru) { + /* Lookup page cache entry from page following the approach +* taken in __set_page_dirty_nobuffers */ + unsigned long flags; + struct address_space *mapping = page_mapping(page); + if (!mapping) + continue; + + xa_lock_irqsave(>i_pages, flags); + BUG_ON(page_mapping(page) != mapping); + nr_refaulted += radix_tree_tag_get(>i_pages, +
Making direct reclaim fail when thrashing
Split from the thread [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 where we were discussing if/how to make the direct reclaim codepath fail if we're excessively thrashing, so that the OOM killer might step in. This is potentially desirable when the thrashing is so bad that the UI stops responding, causing the user to pull the plug. On Tue, Jul 17, 2018 at 7:23 AM, Michal Hocko wrote: > mm/workingset.c allows for tracking when an actual page got evicted. > workingset_refault tells us whether a give filemap fault is a recent > refault and activates the page if that is the case. So what you need is > to note how many refaulted pages we have on the active LRU list. If that > is a large part of the list and if the inactive list is really small > then we know we are trashing. This all sounds much easier than it will > eventually turn out to be of course but I didn't really get to play with > this much. Apologies in advance for any silly mistakes or terrible code that follows, as I am not familiar in this part of the kernel. As mentioned in my last mail, knowing if a page on the active list was refaulted into place appears not trivial, because the eviction information was lost upon refault (it was stored in the page cache shadow entry). Here I'm experimenting by adding another tag to the page cache radix tree, tagging pages that were activated in the refault path. And then in get_scan_count I'm checking how many active pages have that tag, and also looking at the size of the active and inactive lists. It has a performance blow (probably due to looping over the whole active list and doing lots of locking?) but I figured it might serve as one step forward. The results are not exactly as I would expect. Upon launching 20 processes that allocate and memset 100mb RAM each, exhausting all RAM (and no swap available), the kernel starts thrashing and I get numbers like: get_scan_count lru1 active=422714 inactive=19595 refaulted=0 get_scan_count lru3 active=832 inactive=757 refaulted=21 Lots of active anonymous pages (lru1), and none refaulted, perhaps not surprising because it can't swap them out, no swap available. But only few file pages on the lists (lru3), and only a tiny number of refaulted ones, which doesn't line up with your suggestion of detecting when a large part of the active list is made up of refaulted pages. Any further suggestions appreciated. Thanks Daniel --- include/linux/fs.h | 1 + include/linux/radix-tree.h | 2 +- mm/filemap.c | 2 ++ mm/vmscan.c| 37 + 4 files changed, 41 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index d85ac9d24bb3..45f94ffd1c67 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -462,6 +462,7 @@ struct block_device { #define PAGECACHE_TAG_DIRTY0 #define PAGECACHE_TAG_WRITEBACK1 #define PAGECACHE_TAG_TOWRITE 2 +#define PAGECACHE_TAG_REFAULTED3 int mapping_tagged(struct address_space *mapping, int tag); diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 34149e8b5f73..86eccb71ef7e 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -65,7 +65,7 @@ static inline bool radix_tree_is_internal_node(void *ptr) /*** radix-tree API starts here ***/ -#define RADIX_TREE_MAX_TAGS 3 +#define RADIX_TREE_MAX_TAGS 4 #ifndef RADIX_TREE_MAP_SHIFT #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) diff --git a/mm/filemap.c b/mm/filemap.c index 250f675dcfb2..9a686570dc75 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -917,6 +917,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, */ if (!(gfp_mask & __GFP_WRITE) && shadow && workingset_refault(shadow)) { + radix_tree_tag_set(>i_pages, page_index(page), + PAGECACHE_TAG_REFAULTED); SetPageActive(page); workingset_activation(page); } else diff --git a/mm/vmscan.c b/mm/vmscan.c index 03822f86f288..79bc810b43bb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2102,6 +2102,30 @@ enum scan_balance { SCAN_FILE, }; + +static int count_refaulted(struct lruvec *lruvec, enum lru_list lru) { + int nr_refaulted = 0; + struct page *page; + + list_for_each_entry(page, >lists[lru], lru) { + /* Lookup page cache entry from page following the approach +* taken in __set_page_dirty_nobuffers */ + unsigned long flags; + struct address_space *mapping = page_mapping(page); + if (!mapping) + continue; + + xa_lock_irqsave(>i_pages, flags); + BUG_ON(page_mapping(page) != mapping); + nr_refaulted += radix_tree_tag_get(>i_pages, +
Re: [PATCH 2/2] pinctrl/amd: use byte access to clear irq/wake status bits
On Fri, Jul 20, 2018 at 6:38 PM, Daniel Kurtz wrote: > Sounds reasonable. How about: > > - /* Clear interrupt. > -* We must read the pin register again, in case the > -* value was changed while executing > -* generic_handle_irq() above. > + /* > +* Write-1-to-clear irq/wake status bits in MSByte. > +* All other bits in this byte are read-only. > +* This avoids modifying the lower 24-bits > because they may have > +* changed while executing generic_handle_irq() > above. > */ That looks good. Thanks Daniel
Re: [PATCH 2/2] pinctrl/amd: use byte access to clear irq/wake status bits
On Fri, Jul 20, 2018 at 6:38 PM, Daniel Kurtz wrote: > Sounds reasonable. How about: > > - /* Clear interrupt. > -* We must read the pin register again, in case the > -* value was changed while executing > -* generic_handle_irq() above. > + /* > +* Write-1-to-clear irq/wake status bits in MSByte. > +* All other bits in this byte are read-only. > +* This avoids modifying the lower 24-bits > because they may have > +* changed while executing generic_handle_irq() > above. > */ That looks good. Thanks Daniel
Re: [PATCH 2/2] pinctrl/amd: use byte access to clear irq/wake status bits
On Mon, Jul 16, 2018 at 7:57 PM, Daniel Kurtz wrote: > Commit 6afb10267c1692 ("pinctrl/amd: fix masking of GPIO interrupts") > changed to the clearing of interrupt status bits to a RMW in a critical > section. This works, but is a bit overkill. > > The relevant interrupt/wake status bits are in the Most Significant Byte > of a 32-bit word. These two are the only write-able bits in this byte. I don't have the hardware to test this any more, and I also don't have any docs to double if those are really the only writable bits, but looking at the existing driver code it does seem to be the case. I think you should retain the comment noting that the value of the register may have changed since it was read just a few lines above (and hence explaining more precisely why we make the special effort just to modify the MSB), just in case there is further rework of this code in future and we end up walking into the same trap. It was one of those issues that took a frustratingly long time to figure out... Thanks Daniel
Re: [PATCH 2/2] pinctrl/amd: use byte access to clear irq/wake status bits
On Mon, Jul 16, 2018 at 7:57 PM, Daniel Kurtz wrote: > Commit 6afb10267c1692 ("pinctrl/amd: fix masking of GPIO interrupts") > changed to the clearing of interrupt status bits to a RMW in a critical > section. This works, but is a bit overkill. > > The relevant interrupt/wake status bits are in the Most Significant Byte > of a 32-bit word. These two are the only write-able bits in this byte. I don't have the hardware to test this any more, and I also don't have any docs to double if those are really the only writable bits, but looking at the existing driver code it does seem to be the case. I think you should retain the comment noting that the value of the register may have changed since it was read just a few lines above (and hence explaining more precisely why we make the special effort just to modify the MSB), just in case there is further rework of this code in future and we end up walking into the same trap. It was one of those issues that took a frustratingly long time to figure out... Thanks Daniel
Re: [PATCH v2 1/2] platform/x86: asus-wmi: Call led hw_changed API on kbd brightness change
Hi, On Thu, Jun 14, 2018 at 1:58 AM, Chris Chiu wrote: > > On Wed, Jun 13, 2018 at 8:49 PM, Andy Shevchenko > wrote: > > On Mon, Jun 11, 2018 at 10:18 AM, Chris Chiu wrote: > >> Make asus-wmi notify on hotkey kbd brightness changes, listen for > >> brightness events and update the brightness directly in the driver. > > > >> For this purpose, bound check on brightness in kbd_led_set must be > >> based on the same data type to prevent illegal value been set. > > > >> @@ -497,9 +498,9 @@ static void kbd_led_set(struct led_classdev *led_cdev, > >> > >> asus = container_of(led_cdev, struct asus_wmi, kbd_led); > >> > >> - if (value > asus->kbd_led.max_brightness) > >> + if ((int)value > (int)asus->kbd_led.max_brightness) > >> value = asus->kbd_led.max_brightness; > >> - else if (value < 0) > >> + else if ((int)value < 0) > >> value = 0; > > > > I didn't quite understand this part of the problem. > > Does it exist in the current code? Can it be split to a separate change? > > > > Can we avoid those ugly castings? > > > > I'd like to remove the ugly castings but there's a concern I may need some > advices. I don't know whether if the bound check logic ever verified before. > Maybe the value passed via sysfs is already correctly bounded? The casts come from the underlying need to limit the minumum and maximum brightness within available bounds, plus difficulties doing that when you are dealing with an enum data type. kbd_led_set is a function pointer (from led_classdev.brightness_set) which has this definition: void (*brightness_set)(struct led_classdev *led_cdev, enum led_brightness brightness); It seems that the compiler has the choice of whether to use a signed or unsigned type for enums. According to your tests, and a quick test app below, it seems like gcc is using unsigned. #include enum led_brightness { LED_OFF = 0 }; int main(void) { enum led_brightness tmp = -1; if (tmp < 0) printf("less than zero\n"); else printf("gt zero\n"); } This test app prints "gt zero" led-class.c:brightness_store() uses kstrtoul() so there is no chance of passing a negative value through this codepath, as you suspected. But we do need to do something with negative bounds for the code that you are adding here. I suggest doing it like this: static void __kbd_led_set(struct led_classdev *led_cdev, int value) { struct asus_wmi *asus; asus = container_of(led_cdev, struct asus_wmi, kbd_led); if (value > asus->kbd_led.max_brightness) value = asus->kbd_led.max_brightness; else if (value < 0) value = 0; asus->kbd_led_wk = value; queue_work(asus->led_workqueue, >kbd_led_work); } static void kbd_led_set(struct led_classdev *led_cdev, enum led_brightness value) { return __kbd_led_set(led_cdev, value); } Now kbd_led_set can continue being a correctly typed function pointer for led_classdev.brightness_set. And from the code you are adding here you can call __kbd_led_set directly with signed integer values, and rely on correct bounds correction without ugly casts. Andy, what do you think? Thanks Daniel
Re: [PATCH v2 1/2] platform/x86: asus-wmi: Call led hw_changed API on kbd brightness change
Hi, On Thu, Jun 14, 2018 at 1:58 AM, Chris Chiu wrote: > > On Wed, Jun 13, 2018 at 8:49 PM, Andy Shevchenko > wrote: > > On Mon, Jun 11, 2018 at 10:18 AM, Chris Chiu wrote: > >> Make asus-wmi notify on hotkey kbd brightness changes, listen for > >> brightness events and update the brightness directly in the driver. > > > >> For this purpose, bound check on brightness in kbd_led_set must be > >> based on the same data type to prevent illegal value been set. > > > >> @@ -497,9 +498,9 @@ static void kbd_led_set(struct led_classdev *led_cdev, > >> > >> asus = container_of(led_cdev, struct asus_wmi, kbd_led); > >> > >> - if (value > asus->kbd_led.max_brightness) > >> + if ((int)value > (int)asus->kbd_led.max_brightness) > >> value = asus->kbd_led.max_brightness; > >> - else if (value < 0) > >> + else if ((int)value < 0) > >> value = 0; > > > > I didn't quite understand this part of the problem. > > Does it exist in the current code? Can it be split to a separate change? > > > > Can we avoid those ugly castings? > > > > I'd like to remove the ugly castings but there's a concern I may need some > advices. I don't know whether if the bound check logic ever verified before. > Maybe the value passed via sysfs is already correctly bounded? The casts come from the underlying need to limit the minumum and maximum brightness within available bounds, plus difficulties doing that when you are dealing with an enum data type. kbd_led_set is a function pointer (from led_classdev.brightness_set) which has this definition: void (*brightness_set)(struct led_classdev *led_cdev, enum led_brightness brightness); It seems that the compiler has the choice of whether to use a signed or unsigned type for enums. According to your tests, and a quick test app below, it seems like gcc is using unsigned. #include enum led_brightness { LED_OFF = 0 }; int main(void) { enum led_brightness tmp = -1; if (tmp < 0) printf("less than zero\n"); else printf("gt zero\n"); } This test app prints "gt zero" led-class.c:brightness_store() uses kstrtoul() so there is no chance of passing a negative value through this codepath, as you suspected. But we do need to do something with negative bounds for the code that you are adding here. I suggest doing it like this: static void __kbd_led_set(struct led_classdev *led_cdev, int value) { struct asus_wmi *asus; asus = container_of(led_cdev, struct asus_wmi, kbd_led); if (value > asus->kbd_led.max_brightness) value = asus->kbd_led.max_brightness; else if (value < 0) value = 0; asus->kbd_led_wk = value; queue_work(asus->led_workqueue, >kbd_led_work); } static void kbd_led_set(struct led_classdev *led_cdev, enum led_brightness value) { return __kbd_led_set(led_cdev, value); } Now kbd_led_set can continue being a correctly typed function pointer for led_classdev.brightness_set. And from the code you are adding here you can call __kbd_led_set directly with signed integer values, and rely on correct bounds correction without ugly casts. Andy, what do you think? Thanks Daniel
Re: [PATCH 1/2] platform/x86: asus-wmi: Call new led hw_changed API on kbd brightness change
On Mon, Jun 4, 2018 at 7:22 AM, Hans de Goede wrote: > Is this really a case of the hardware itself processing the > keypress and then changing the brightness *itself* ? > > From the "[PATCH 2/2] platform/x86: asus-wmi: Add keyboard backlight > toggle support" patch I get the impression that the driver is > modifying the brightness from within the kernel rather then the > keyboard controller are ACPI embeddec-controller doing it itself. > > If that is the case then the right fix is for the driver to stop > mucking with the brighness itself, it should simply report the > right keyboard events and export a led interface and then userspace > will do the right thing (and be able to offer flexible policies > to the user). Before this modification, the driver reports the brightness keypresses to userspace and then userspace can respond by changing the brightness level, as you describe. You are right in that the hardware doesn't change the brightness directly itself, which is the normal usage of LED_BRIGHT_HW_CHANGED. However this approach was suggested by Benjamin Berg and Bastien Nocera in the thread: Re: [PATCH v2] platform/x86: asus-wmi: Add keyboard backlight toggle support https://marc.info/?l=linux-kernel=152639169210655=2 The issue is that we need to support a new "keyboard backlight brightness cycle" key (in the patch that follows this one) which doesn't fit into any definitions of keys recognised by the kernel and likewise there's no userspace code to handle it. If preferred we could leave the standard brightness keys behaving as they are (input events) and make the new special key type directly handled by the kernel? Daniel
Re: [PATCH 1/2] platform/x86: asus-wmi: Call new led hw_changed API on kbd brightness change
On Mon, Jun 4, 2018 at 7:22 AM, Hans de Goede wrote: > Is this really a case of the hardware itself processing the > keypress and then changing the brightness *itself* ? > > From the "[PATCH 2/2] platform/x86: asus-wmi: Add keyboard backlight > toggle support" patch I get the impression that the driver is > modifying the brightness from within the kernel rather then the > keyboard controller are ACPI embeddec-controller doing it itself. > > If that is the case then the right fix is for the driver to stop > mucking with the brighness itself, it should simply report the > right keyboard events and export a led interface and then userspace > will do the right thing (and be able to offer flexible policies > to the user). Before this modification, the driver reports the brightness keypresses to userspace and then userspace can respond by changing the brightness level, as you describe. You are right in that the hardware doesn't change the brightness directly itself, which is the normal usage of LED_BRIGHT_HW_CHANGED. However this approach was suggested by Benjamin Berg and Bastien Nocera in the thread: Re: [PATCH v2] platform/x86: asus-wmi: Add keyboard backlight toggle support https://marc.info/?l=linux-kernel=152639169210655=2 The issue is that we need to support a new "keyboard backlight brightness cycle" key (in the patch that follows this one) which doesn't fit into any definitions of keys recognised by the kernel and likewise there's no userspace code to handle it. If preferred we could leave the standard brightness keys behaving as they are (input events) and make the new special key type directly handled by the kernel? Daniel