Re: [PATCH] x86/pci: fix intel_mid_pci.c build error when ACPI is not enabled
On 8/13/2020 12:58 PM, Randy Dunlap wrote: From: Randy Dunlap Fix build error when CONFIG_ACPI is not set/enabled by adding the header file which contains a stub for the function in the build error. ../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’: ../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Signed-off-by: Randy Dunlap Cc: Jacob Pan Cc: Len Brown Cc: Bjorn Helgaas Cc: Jesse Barnes Cc: Arjan van de Ven Cc: linux-...@vger.kernel.org --- Found in linux-next, but applies to/exists in mainline also. Alternative.1: X86_INTEL_MID depends on ACPI Alternative.2: drop X86_INTEL_MID support at this point I'd suggest Alternative 2; the products that needed that (past tense, that technology is no longer need for any newer products) never shipped in any form where a 4.x or 5.x kernel could work, and they are also all locked down...
Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
On 2/20/2019 7:35 AM, David Laight wrote: From: Sent: 16 February 2019 12:56 To: Li, Aubrey ... The above experiment just confirms what I said: The numbers are inaccurate and potentially misleading to a large extent when the AVX using task is not scheduled out for a longer time. Not only that, they won't detect programs that use AVX-512 but never context switch with live AVX-512 registers. you are completely correct in stating that this approach is basically sampling at a relatively course level and such sampling will give false negatives the alternative is not sampling, and not knowing anything at all, unless you have a better suggestion on how to help find tasks that use avx512 in a low overhead way (the typical use case is trying to find workloads that use avx512 to help scheduling those workloads in the future in the cloud orchestrator, for example to help them favor machines that support avx512 over machines that don't)
Re: [PATCH] x86/speculation: Add document to describe Spectre and its mitigations
On 1/14/2019 5:06 AM, Jiri Kosina wrote: On Mon, 14 Jan 2019, Pavel Machek wrote: Frankly I'd not call it Meltdown, as it works only on data in the cache, so the defense is completely different. Seems more like a l1tf :-). Meltdown on x86 also seems to work only for data in L1D, but the pipeline could be constructed in a way that data are actually fetched into L1D before speculation gives up, which is not the case on ppc (speculation aborts on L2->L1 propagation IIRC). That's why flushing L1D on ppc is sufficient, but on x86 it's not. assuming L1D is not shared between SMT threads obviously :)
Re: [PATCH] x86/speculation: Add document to describe Spectre and its mitigations
On 12/31/2018 8:22 AM, Ben Greear wrote: On 12/21/2018 05:17 PM, Tim Chen wrote: On 12/21/18 1:59 PM, Ben Greear wrote: On 12/21/18 9:44 AM, Tim Chen wrote: Thomas, Andi and I have made an update to our draft of the Spectre admin guide. We may be out on Christmas vacation for a while. But we want to send it out for everyone to take a look. Can you add a section on how to compile out all mitigations that have anything beyond negligible performance impact for those running systems where performance is more important than security? If you don't worry about security and performance is paramount, then boot with "nospectre_v2". That's explained in the document. There seem to be lots of different variants of this type of problem. It was not clear to me that just doing nospectre_v2 would be sufficient to get back full performance. And anyway, I would like to compile the kernel to not need that command-line option, so I am still interesting in what compile options need to be set to what values... the cloud people call this scenario "single tenant".. there might be different "users" in the uid sense, but they're all owned by the same folks it would not be insane to make a CONFIG_SINGLE_TENANT kind of option under which we can group thse kind of things (and likely others)
Re: WARNING in __rcu_read_unlock
On 12/17/2018 3:29 AM, Paul E. McKenney wrote: As does this sort of report on a line that contains simple integer arithmetic and boolean operations.;-) Any chance of a bisection? btw this looks like something caused a stack overflow and thus all the weirdness that then happens
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/11/2018 3:46 PM, Li, Aubrey wrote: On 2018/12/12 1:18, Dave Hansen wrote: On 12/10/18 4:24 PM, Aubrey Li wrote: The tracking turns on the usage flag at the next context switch of the task, but requires 3 consecutive context switches with no usage to clear it. This decay is required because well-written AVX-512 applications are expected to clear this state when not actively using AVX-512 registers. One concern about this: Given a HZ=1000 system, this means that the flag needs to get scanned every ~3ms. That's a pretty good amount of scanning on a system with hundreds or thousands of tasks running around. How many tasks does this scale to until you're eating up an entire CPU or two just scanning /proc? Do we have a real requirement to do this in practical environment? AFAIK, 1s or even 5s is good enough in some customers environment. maybe instead of a 1/0 bit, it's useful to store the timestamp of the last time we found the task to use avx? (need to find a good time unit)
Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode
On processors with enhanced IBRS support, we recommend setting IBRS to 1 and left set. Then why doesn't CPU with EIBRS support acutally *default* to '1', with opt-out possibility for OS? (slightly longer answer) you can pretty much assume that on these CPUs, IBRS doesn't actually do anything (e.g. just a scratch bit) we could debate (and did :-)) for some time what the default value should be at boot, but it kind of is one of those minor issues that should not hold up getting things out. it could well be that the cpus that do this will ship with 1 as default, but it's hard to guarantee across many products and different CPU vendors when time was tight.
Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode
On processors with enhanced IBRS support, we recommend setting IBRS to 1 and left set. Then why doesn't CPU with EIBRS support acutally *default* to '1', with opt-out possibility for OS? the BIOSes could indeed get this set up this way. do you want to trust the bios to get it right?
Re: [patch 01/24] x86/speculation: Update the TIF_SSBD comment
On 11/21/2018 2:53 PM, Borislav Petkov wrote: On Wed, Nov 21, 2018 at 11:48:41PM +0100, Thomas Gleixner wrote: Btw, I really do not like the app2app wording. I'd rather go for usr2usr, but that's kinda horrible as well. But then, all of this is horrible. Any better ideas? It needs to have "task isolation" in there somewhere as this is what it does, practically. But it needs to be more precise as in "isolates the tasks from influence due to shared hardware." :) part of the problem is that "sharing" has multiple dimensions: time and space (e.g. hyperthreading) which makes it hard to find a nice term for it other than describing who attacks whom
Re: STIBP by default.. Revert?
On 11/20/2018 11:27 PM, Jiri Kosina wrote: On Mon, 19 Nov 2018, Arjan van de Ven wrote: In the documentation, AMD officially recommends against this by default, and I can speak for Intel that our position is that as well: this really must not be on by default. Thanks for pointing to the AMD doc, it's indeed clearly stated there. Is there any chance this could perhaps be added to Intel documentation as well, so that we avoid cases like this in the future? absolutely that's now already in progress; the doc publishing process is a bit on the long side unfortunately so it won't be today ;)
Re: Re: STIBP by default.. Revert?
On 11/19/2018 6:00 AM, Linus Torvalds wrote: On Sun, Nov 18, 2018 at 1:49 PM Jiri Kosina wrote: So why do that STIBP slow-down by default when the people who *really* care already disabled SMT? BTW for them, there is no impact at all. Right. People who really care about security and are anal about it do not see *any* advantage of the patch. In the documentation, AMD officially recommends against this by default, and I can speak for Intel that our position is that as well: this really must not be on by default. STIBP and its friends are there as tools, and were created early on as big hammers because that is all that one can add in a microcode update.. expensive big hammers. In some ways it's analogous to the "disable caches" bit in CR0. sure it's there as a big hammer, but you don't set that always just because caches could be used for a side channel Using these tools much more surgically is fine, if a paranoid task wants it for example, or when you know you are doing a hard core security transition. But always on? Yikes.
Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state
I'd prefer the kernel to do such clustering... I think that is a next step. Also, while the kernel can do this at a best effort basis, it cannot take into account things the kernel doesn't know about, like high priority job peak load etc.., things a job scheduler would know. Then again, a job scheduler would likely already know about the AVX state anyway. the job scheduler can guess. unless it can also *measure* it won't know for sure... so even in that scenario having a decent way to report actuals is useful
Re: [RFC] x86, tsc: Add kcmdline args for skipping tsc calibration sequences
On 7/13/2018 12:19 PM, patrickg wrote: This RFC patch is intended to allow bypass CPUID, MSR and QuickPIT calibration methods should the user desire to. The current ordering in ML x86 tsc is to calibrate in the order listed above; returning whenever there's a successful calibration. However there are certain BIOS/HW Designs for overclocking that cause the TSC to change along with the max core clock; and simple 'trusting' calibration methodologies will lead to the TSC running 'faster' and eventually, TSC instability. that would be a real violation of the contract between cpu and OS: tsc is not supposed to change for the duration of the boot I only know that there's a use-case for me to want to be able to skip CPUID calibration, however I included args for skipping all the rest just so that all functionality is covered in the long run instead of just one use-case. wouldn't it be better to start the detailed calibration with the value from CPUID instead; that way we also properly calibrate spread spectrum etc... I thought we switched to that recently to be honest...
Re: [RFC][PATCH] x86: proposed new ARCH_CAPABILITIES MSR bit for RSB-underflow
On 2/16/2018 11:43 AM, Linus Torvalds wrote: On Fri, Feb 16, 2018 at 11:38 AM, Linus Torvalds wrote: Of course, your patch still doesn't allow for "we claim to be skylake for various other independent reasons, but the RSB issue is fixed". .. maybe nobody ever has a reason to do that, though? yeah I would be extremely surprised Who knows, virtualization people may simply want the user to specify the model, but then make the Spectre decisions be based on actual hardware capabilities (whether those are "current" or "some minimum base"). once you fake to be skylake when you're not, you do that for a reason; normallyt that reason is that you COULD migrate to a skylake. (and migration is not supposed to be visble to the guest OS) and at that point you are a skylake for all intents and purposes. (and the virtualization people also really hate it when the hardware burst the bubble of this fakeing hardware to be not what it is)
Re: [PATCH] platform/x86: intel_turbo_max_3: Remove restriction for HWP platforms
On 2/14/2018 11:29 AM, Andy Shevchenko wrote: On Mon, Feb 12, 2018 at 9:50 PM, Srinivas Pandruvada wrote: On systems supporting HWP (Hardware P-States) mode, we expected to enumerate core priority via ACPI-CPPC tables. Unfortunately deployment of TURBO 3.0 didn't use this method to show core priority. So users are not able to utilize this feature in HWP mode. So remove the loading restriction of this driver for HWP enabled systems. Even if there are some systems, which are providing the core priority via ACPI CPPC, this shouldn't cause any conflict as the source of priority definition is same. Pushed to my review and testing queue, thanks! P.S. Should it go to stable? older stable at least did not have the problem
Re: [PATCH 4.9 43/92] x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
So, any hints on what you think should be the correct fix here? the patch sure looks correct to me, it now has a nice table for CPU IDs including all of AMD (and soon hopefully the existing Intel ones that are not exposed to meltdown)
Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
On 1/31/2018 2:15 AM, Thomas Gleixner wrote: Good luck with making all that work. on the Intel side we're checking what we can do that works and doesn't break things right now; hopefully we just end up with a bit in the arch capabilities MSR for "you should do RSB stuffing" and then the HV's can emulate that. (people sometimes think that should be a 5 minute thing, but we need to check many cpu models/etc to make sure a bit we pick is really free etc which makes it take longer than some folks have patience for)
Re: [PATCH] x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
On 1/30/2018 5:11 AM, Borislav Petkov wrote: On Tue, Jan 30, 2018 at 01:57:21PM +0100, Thomas Gleixner wrote: So much for the theory. That's not going to work. If the boot cpu has the feature then the alternatives will have been applied. So even if the flag mismatch can be observed when a secondary CPU comes up the outcome will be access to a non existing MSR and #GP. Yes, with mismatched microcode we're f*cked. I think in the super early days of SMP there was an occasional broken BIOS. (and when Linux then did the ucode update it was sane again) Not since a long time though (I think the various certification suites check for it now) So my question is: is there such microcode out there or is this something theoretical which we want to address? at this point it's insane theoretical; no OS can actually cope with this, so if you're an OEM selling this, your customer can run zero OSes ;-) (.. and adressing this will be ugly, no matter what.) And if I were able to wish, I'd like to blacklist that microcode in dracut so that it doesn't come anywhere near my system. I'm not sure what you'd want dracut to do... panic() the system on such a bios?
Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
On 1/29/2018 7:32 PM, Linus Torvalds wrote: On Mon, Jan 29, 2018 at 5:32 PM, Arjan van de Ven wrote: the most simple solution is that we set the internal feature bit in Linux to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest in a VM. That sounds reasonable. However, wouldn't it be even better to extend on the current cpuid model, and actually have some real architectural bits in there. Maybe it could be a bit in that IA32_ARCH_CAPABILITIES MSR. Say, add a bit #2 that says "ret falls back on BTB". Then that bit basically becomes the "Skylake bit". Hmm? we can try to do that, but existing systems don't have that, and then we get in another long thread here about weird lists of stuff ;-)
Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
On 1/29/2018 4:23 PM, Linus Torvalds wrote: Why do you even _care_ about the guest, and how it acts wrt Skylake? What you should care about is not so much the guests (which do their own thing) but protect guests from each other, no? the most simple solution is that we set the internal feature bit in Linux to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest in a VM. The stuffing is not free, but it's also not insane either... so if it's turned on in guests, the impact is still limited, while bare metal doesn't need it at all
Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
On 1/29/2018 12:42 PM, Eduardo Habkost wrote: The question is how the hypervisor could tell that to the guest. If Intel doesn't give us a CPUID bit that can be used to tell that retpolines are enough, maybe we should use a hypervisor CPUID bit for that? the objective is to have retpoline be safe everywhere and never use IBRS (Linus was also pretty clear about that) so I'm confused by your question
Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
On 1/26/2018 10:11 AM, David Woodhouse wrote: I am *actively* ignoring Skylake right now. This is about per-SKL userspace even with SMEP, because we think Intel's document lies to us. if you think we lie to you then I think we're done with the conversation? Please tell us then what you deploy in AWS for your customers ? or show us research that shows we lied to you?
Re: [PATCH v3 5/6] x86/pti: Do not enable PTI on processors which are not vulnerable to Meltdown
On 1/26/2018 7:27 AM, Dave Hansen wrote: On 01/26/2018 04:14 AM, Yves-Alexis Perez wrote: I know we'll still be able to manually enable PTI with a command line option, but it's also a hardening feature which has the nice side effect of emulating SMEP on CPU which don't support it (e.g the Atom boxes above). For Meltdown-vulnerable systems, it's a no brainer: pti=on. The vulnerability there is just too much. But, if we are going to change the default, IMNHO, we need a clear list of what SMEP emulation mitigates and where. RSB-related Variant 2 stuff on Atom where the kernel speculatively 'ret's back to userspace is certainly a concern. But, there's a lot of other RSB stuffing that's going on that will mitigate that too. Were you thinking of anything concrete? not Atom though. Atom has has SMEP for a very long time, at least the ones that do speculation do afaict. SMEP is for other bugs (dud kernel function pointer) and for that, emulating SMEP is an interesting opt-in for sure.
Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
This patch tries to address the case when we do switch to init_mm and back. Do you still have objections to the approach in this patch to save the last active mm before switching to init_mm? how do you know the last active mm did not go away and started a new process with new content? (other than taking a reference which has other side effects)
Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
The idea is simple, do what we do for virt. Don't send IPI's to CPUs that don't need them (in virt's case because the vCPU isn't running, in our case because we're not in fact running a user process), but mark the CPU as having needed a TLB flush. I am really uncomfortable with that idea. You really can't run code safely on a cpu where the TLBs in the CPU are invalid or where a CPU that does (partial) page walks would install invalid PTEs either through actual or through speculative execution. (in the virt case there's a cheat, since the code is not actually running there isn't a cpu with TLBs live. You can't do that same cheat for this case)
Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
On 1/25/2018 5:50 AM, Peter Zijlstra wrote: On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote: This means that 'A -> idle -> A' should never pass through switch_mm to begin with. Please clarify how you think it does. the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states for a tlb flush. The intel_idle code does, not the idle code. This is squirreled away in some driver :/ afaik (but haven't looked in a while) acpi drivers did too (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep state to just flush a tlb that's empty, the performance of that is horrific) Hurmph. I'd rather fix that some other way than leave_mm(), this is piling special on special. the problem was tricky. but of course if something better is possible lets figure this out problem is that an IPI to an idle cpu is both power inefficient and will take time, exit of a deep C state can be, say 50 to 100 usec range of time (it varies by many things, but for abstractly thinking about the problem one should generally round up to nice round numbers) if you have say 64 cores that had the mm at some point, but 63 are in idle, the 64th really does not want to IPI each of those 63 serially (technically this is does not need to be serial but IPI code is tricky, some things end up serializing this a bit) to get the 100 usec hit 63 times. Actually, even if it's not serialized, even ONE hit of 100 usec is unpleasant. so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as general objective. but not getting flush IPIs is only safe if the TLBs in the CPU have nothing that such IPI would want to flush, so the TLB needs to be empty of those things. the only way to do THAT is to switch to an mm that is safe; a leave_mm() does this, but I'm sure other options exist. note: While a CPU that is in a deeper C state will itself flush the TLB, you don't know if you will actually enter that deep at the time of making OS decisions (if an interrupt comes in the cycle before mwait, mwait becomes a nop for example). In addition, once you wake up, you don't want the CPU to go start filling the TLBs with invalid data so you can't really just set a bit and flush after leaving idle
Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process
This means that 'A -> idle -> A' should never pass through switch_mm to begin with. Please clarify how you think it does. the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states for a tlb flush. (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep state to just flush a tlb that's empty, the performance of that is horrific)
Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
On 1/24/2018 1:10 AM, Greg Kroah-Hartman wrote: That means the whitelist ends up basically empty right now. Should I add a command line parameter to override it? Otherwise we end up having to rebuild the kernel every time there's a microcode release which covers a new CPU SKU (which is why I kind of hate the whitelist, but Arjan is very insistent...) Ick, no, whitelists are a pain for everyone involved. Don't do that unless it is absolutely the only way it will ever work. Arjan, why do you think this can only be done as a whitelist? I suggested a minimum version list for those cpus that need it. microcode versions are tricky (and we've released betas etc etc with their own numbers) and as a result there might be several numbers that have those issues with their IBRS for the same F/M/S
Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
On 1/21/2018 8:21 AM, Ingo Molnar wrote: So if it's only about the scheduler barrier, what cycle cost are we talking about here? in the order of 5000 to 1 cycles. (depends a bit on the cpu generation but this range is a reasonable approximation) Because putting something like this into an ELF flag raises the question of who is allowed to set the flag - does a user-compiled binary count? If yes then it would be a trivial thing for local exploits to set the flag and turn off the barrier. the barrier is about who you go TO, e.g. the thing under attack. as you say, depending on the thing that would be the evil one does not work.
Re: kexec reboot fails with extra wbinvd introduced for AME SME
Does anybody have any other ideas? the only other weird case that comes to mind; what happens if there's a line dirty in the caches, but the memory is now mapped uncached. (Which could happen if kexec does muck with MTRRs, CR0 or other similar things in weird ways)... not sure what happens in CPU, a machine check for cache inclusion violations is not beyond the imagination and might be lethal this would explain a kexec specific angle versus general normal (but rare) use of wbinvd. other weird case could be cached mmio (not common, but some gpus and the like can do it) with iommu/VT-D in the middle, and during kexec VT-D shutting down the iommu before the wbinvd. This would be... highly odd... but this report already is in highly odd space.
Re: kexec reboot fails with extra wbinvd introduced for AME SME
Does anybody have any other ideas? wbinvd is thankfully not common, but also not rare (MTRR setup and a bunch of other cases) and in some other operating systems it happens even more than on Linux.. it's generally not totally broken like this. I can only imagine a machine check case where a write back to a bad cell causes some parity error or something... but it's odd that no other machine checks are reported? (can the user check for this please)
Re: [tip:x86/pti] x86/retpoline: Fill RSB on context switch for affected CPUs
This would means that userspace would see return predictions based on the values the kernel 'stuffed' into the RSB to fill it. Potentially this leaks a kernel address to userspace. KASLR pretty much died in May this year to be honest with the KAISER paper (if not before then) also with KPTI the address won't have a TLB mapping so it wouldn't actually be speculated into.
Re: [PATCH 3/8] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest
On 1/10/2018 5:20 AM, Paolo Bonzini wrote: * a simple specification that does "IBRS=1 blocks indirect branch prediction altogether" would actually satisfy the specification just as well, and it would be nice to know if that's what the processor actually does. it doesn't exactly, not for all. so you really do need to write ibrs again.
Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
On 1/9/2018 8:17 AM, Paolo Bonzini wrote: On 09/01/2018 16:19, Arjan van de Ven wrote: On 1/9/2018 7:00 AM, Liran Alon wrote: - ar...@linux.intel.com wrote: On 1/9/2018 3:41 AM, Paolo Bonzini wrote: The above ("IBRS simply disables the indirect branch predictor") was my take-away message from private discussion with Intel. My guess is that the vendors are just handwaving a spec that doesn't match what they have implemented, because honestly a microcode update is unlikely to do much more than an old-fashioned chicken bit. Maybe on Skylake it does though, since the performance characteristics of IBRS are so different from previous processors. Let's ask Arjan who might have more information about it, and hope he actually can disclose it... IBRS will ensure that, when set after the ring transition, no earlier branch prediction data is used for indirect branches while IBRS is set Let me ask you my questions, which are independent of L0/L1/L2 terminology. 1) Is vmentry/vmexit considered a ring transition, even if the guest is running in ring 0? If IBRS=1 in the guest and the host is using IBRS, the host will not do a wrmsr on exit. Is this safe for the host kernel? I think the CPU folks would want us to write the msr again. 2) How will the future processors work where IBRS should always be =1? IBRS=1 should be "fire and forget this ever happened". This is the only time anyone should use IBRS in practice (and then the host turns it on and makes sure to not expose it to the guests I hope)
Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
I'm sorry I'm not familiar with your L0/L1/L2 terminology (maybe it's before coffee has had time to permeate the brain) These are standard terminology for guest levels: L0 == hypervisor that runs on bare-metal L1 == hypervisor that runs as L0 guest. L2 == software that runs as L1 guest. (We are talking about nested virtualization here) 1. I really really hope that the guests don't use IBRS but use retpoline. At least for Linux that is going to be the prefered approach. 2. For the CPU, there really is only "bare metal" vs "guest"; all guests are "guests" no matter how deeply nested. So for the language of privilege domains etc, nested guests equal their parent.
Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
On 1/9/2018 7:00 AM, Liran Alon wrote: - ar...@linux.intel.com wrote: On 1/9/2018 3:41 AM, Paolo Bonzini wrote: The above ("IBRS simply disables the indirect branch predictor") was my take-away message from private discussion with Intel. My guess is that the vendors are just handwaving a spec that doesn't match what they have implemented, because honestly a microcode update is unlikely to do much more than an old-fashioned chicken bit. Maybe on Skylake it does though, since the performance characteristics of IBRS are so different from previous processors. Let's ask Arjan who might have more information about it, and hope he actually can disclose it... IBRS will ensure that, when set after the ring transition, no earlier branch prediction data is used for indirect branches while IBRS is set Consider the following scenario: 1. L1 runs with IBRS=1 in Ring0. 2. L1 restores L2 SPEC_CTRL and enters into L2. 3. L1 VMRUN exits into L0 which backups L1 SPEC_CTRL and enters L2 (using same VMCB). 4. L2 populates BTB/BHB with values and cause a hypercall which #VMExit into L0. 5. L0 backups L2 SPEC_CTRL and writes IBRS=1. 6. L0 restores L1 SPEC_CTRL and enters L1. 7. L1 backups L2 SPEC_CTRL and writes IBRS=1. I'm sorry I'm not familiar with your L0/L1/L2 terminology (maybe it's before coffee has had time to permeate the brain)
Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
On 1/9/2018 3:41 AM, Paolo Bonzini wrote: The above ("IBRS simply disables the indirect branch predictor") was my take-away message from private discussion with Intel. My guess is that the vendors are just handwaving a spec that doesn't match what they have implemented, because honestly a microcode update is unlikely to do much more than an old-fashioned chicken bit. Maybe on Skylake it does though, since the performance characteristics of IBRS are so different from previous processors. Let's ask Arjan who might have more information about it, and hope he actually can disclose it... IBRS will ensure that, when set after the ring transition, no earlier branch prediction data is used for indirect branches while IBRS is set (this is a english summary of two pages of technical spec so it lacks the language lawyer precision) because of this promise, the implementation tends to be impactful and it is very strongly recommended that retpoline is used instead of IBRS. (with all the caveats already on lkml) the IBPB is different, this is a covenient thing for switching between VM guests etc
Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution
It sounds like Coverity was used to produce these patches? If so, is there a plan to have smatch (hey Dan) or other open source static analysis tool be possibly enhanced to do a similar type of work? I'd love for that to happen; the tricky part is being able to have even a sort of sensible concept of "trusted" vs "untrusted" value... if you look at a very small window of code, that does not work well; you likely need to even look (as tool) across .c file boundaries
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 1:11 AM, Thomas Gleixner wrote: On Thu, 20 Jul 2017, Li, Aubrey wrote: Don't get me wrong, even if a fast path is acceptable, we still need to figure out if the coming idle is short and when to switch. I'm just worried about if irq timings is not an ideal statistics, we have to skip it too. There is no ideal solution ever. Lets sit back and look at that from the big picture first before dismissing a particular item upfront. The current NOHZ implementation does: predict = nohz_predict(timers, rcu, arch, irqwork); if ((predict - now) > X) stop_tick() The C-State machinery does something like: predict = cstate_predict(next_timer, scheduler); cstate = cstate_select(predict); That disconnect is part of the problem. What we really want is: predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings); two separate predictors is clearly a recipe for badness. (likewise, C and P states try to estimate "performance sensitivity" and sometimes estimate in opposite directions) to be honest, performance sensitivity estimation is probably 10x more critical for C state selection than idle duration; a lot of modern hardware will do the energy efficiency stuff in a microcontroller when it coordinates between multiple cores in the system on C and P states. (both x86 and ARM have such microcontroller nowadays, at least for the higher performance designs)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 5:50 AM, Paul E. McKenney wrote: To make this work reasonably, you would also need some way to check for the case where the prediction idle time is short but the real idle time is very long. so the case where you predict very short but is actually "indefinite", the real solution likely is that we set a timer some time in the future (say 100msec, or some other value that is long but not indefinite) where we wake up the system and make a new prediction, since clearly we were insanely wrong in the prediction and should try again. that or we turn the prediction from a single value into a range of (expected, upper bound) where upper bound is likely the next timer or other going-to-happen events.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 9:36 AM, Peter Zijlstra wrote: On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote: the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) C1-lite would then have a threshold < C1, whereas I understood the desire to be for the fast-idle crud to have a larger threshold than C1 currently has. That is, from what I understood, they want C1 selected *longer*. that's just a matter of fixing the C1 and later thresholds to line up right. shrug that's the most trivial thing to do, it's a number in a table. some distros do those tunings anyway when they don't like the upstream tunings
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 8:20 AM, Paul E. McKenney wrote: 3.2) how to determine if the idle is short or long. My current proposal is to use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I didn't get the details of an auto-adjust mechanism yet the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) this C1-lite would then skip some of the idle steps like the nohz logic. How we plumb that ... might end up being a flag or whatever, we'll figure that out easily. as long as "real C1" has a break even time that is appropriate compared to C1-lite, we'll only pick C1-lite for very very short idles like is desired... but we don't end up creating a parallel infra for picking states, that part just does not make sense to me tbh I have yet to see any reason why C1-lite couldn't be just another C-state for everything except the actual place where we do the "go idle" last bit of logic. (Also note that for extreme short idles, today we just spinloop (C0), so by this argument we should also do a C0-lite.. or make this C0 always the lite variant)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:53 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually Predicted by what? this chart was with the current linux predictor http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next timer ;-) (which way back linux was doing)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:46 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude) next tick is the worst thing to look at for interrupt heavy workloads as well it was better than what was there before (without discount and without detecting repeated patterns) the next tick (as computed by the nohz code) can be far away, while the I/O interrupts come in at a high frequency. That's where Daniel Lezcanos work of predicting interrupts comes in and that's the right solution to the problem. The core infrastructure has been merged, just the idle/cpufreq users are not there yet. All you need to do is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. yes ;-) also note that the predictor does not need to perfect, on most systems C states are an order of magnitude apart in terms of power/performance/latency so if you get the general order of magnitude right the predictor is doing its job. (this is not universally true, but physics of power gating/etc tend to drive to this conclusion; the cost of implementing an extra state very close to another state means that the HW folks are unlikely to do the less power saving state of the two to save their cost and testing effort)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually http://git.fenrus.org/tmp/linux2.png
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/14/2017 8:38 AM, Peter Zijlstra wrote: No, that's wrong. We want to fix the normal C state selection process to pick the right C state. The fast-idle criteria could cut off a whole bunch of available C states. We need to understand why our current C state pick is wrong and amend the algorithm to do better. Not just bolt something on the side. I can see a fast path through selection if you know the upper bound of any selection is just 1 state. But also, how much of this is about "C1 be fast" versus "selecting C1 is slow" a lot of the patches in the thread seem to be about making a lighter/faster C1, which is reasonable (you can even argue we might end up with 2 C1s, one fast one full feature)
Re: [x86/mm] e2a7dcce31: kernel_BUG_at_arch/x86/mm/tlb.c
On 5/27/2017 9:56 AM, Andy Lutomirski wrote: On Sat, May 27, 2017 at 9:00 AM, Andy Lutomirski wrote: On Sat, May 27, 2017 at 6:31 AM, kernel test robot wrote: FYI, we noticed the following commit: commit: e2a7dcce31f10bd7471b4245a6d1f2de344e7adf ("x86/mm: Rework lazy TLB to track the actual loaded mm") https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git x86/tlbflush_cleanup Ugh, there's an unpleasant interaction between this patch and intel_idle. I suspect that the intel_idle code in question is either wrong or pointless, but I want to investigate further. Ingo, can you hold off on applying this patch? I think this is what's going on: intel_idle has an optimization and sometimes calls leave_mm(). This is a rather expensive way of working around x86 Linux's fairly weak lazy mm handling. It also abuses the whole switch_mm state machine. In particular, there's no guarantee that the mm is actually lazy at the time. The old code didn't care, but the new code can oops. The short-term fix is to just reorder the code in leave_mm() to avoid the OOPS. fwiw the reason the code is in intel_idle is to avoid tlb flush IPIs to idle cpus, once the cpu goes into a deep enough idle state. In the current linux code, that is done by no longer having the old TLB live on the CPU, by switching to the neutral kernel-only set of tlbs. If your proposed changes do that (avoid the IPI/wakeup), great! (if not, there should be a way to do that)
Re: [patch 12/18] async: Adjust system_state checks
On 5/14/2017 11:27 AM, Thomas Gleixner wrote: looks good .. ack
Re: [PATCH] use get_random_long for the per-task stack canary
On 5/4/2017 6:32 AM, Daniel Micay wrote: The stack canary is an unsigned long and should be fully initialized to random data rather than only 32 bits of random data. that makes sense to me... ack
Re: [PATCH 5/6] notifiers: Use CHECK_DATA_CORRUPTION() on checks
On 3/22/2017 12:29 PM, Kees Cook wrote: When performing notifier function pointer sanity checking, allow CONFIG_BUG_ON_DATA_CORRUPTION to upgrade from a WARN to a BUG. Additionally enables CONFIG_DEBUG_NOTIFIERS when selecting CONFIG_BUG_ON_DATA_CORRUPTION. Any feedback on this change? By default, this retains the existing WARN behavior... if you're upgrading, is the end point really a panic() ? e.g. do you assume people to also set panic-on-oops?
Re: [PATCH 1/5] x86: Implement __WARN using UD0
On 3/21/2017 8:14 AM, Peter Zijlstra wrote: For self-documentation purposes, maybe use a define for the length of the ud0 instruction? #define TWO 2 ;-) some things make sense as a define, others don't (adding a comment, maybe)
Re: [PATCH] x86/dmi: Switch dmi_remap to ioremap_cache
On 3/9/2017 9:48 AM, Julian Brost wrote: I'm note entirely sure whether it's actually the kernel or HP to blame, but for now, hp-health is completely broken on 4.9 (probably on everything starting from 4.6), so this patch should be reviewed again. it looks like another kernel driver is doing a conflicting mapping. do these HP tools come with their own kernel drivers or are those in the upstream kernel nowadays?
Re: [PATCH] x86: Implement __WARN using UD0
On 2/23/2017 5:28 AM, Peter Zijlstra wrote: By using "UD0" for WARNs we remove the function call and its possible __FILE__ and __LINE__ immediate arguments from the instruction stream. Total image size will not change much, what we win in the instruction stream we'll loose because of the __bug_table entries. Still, saves on I$ footprint and the total image size does go down a bit. well I am a little sceptical; WARNs are rare so the code (other than the test) should be waaay out of line already (unlikely() and co). And I assume you're not removing the __FILE__ and __LINE__ info, since that info is actually high value for us developers... so what are you actually saving? (icache saving is only real if the line that the cold code lives on would actually end up in icache for other reasons; I would hope the compiler puts the out of line code WAY out of line)
Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location
On 1/5/2017 9:54 AM, Thomas Garnier wrote: That's my goal too. I started by doing a RO remap and got couple problems with hibernation. I can try again for the next iteration or delay it for another patch. I also need to look at KVM GDT usage, I am not familiar with it yet. don't we write to the GDT as part of the TLS segment stuff for glibc ?
Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location
On 1/5/2017 8:40 AM, Thomas Garnier wrote: Well, it happens only when KASLR memory randomization is enabled. Do you think it should have a separate config option? no I would want it a runtime option "sgdt from ring 3" is going away with UMIP (and is already possibly gone in virtual machines, see https://lwn.net/Articles/694385/) and for those cases it would be a shame to lose the randomization
Re: [PATCH 1/3] cpuidle/menu: stop seeking deeper idle if current state is too deep
On 1/5/2017 7:43 AM, Rik van Riel wrote: On Thu, 2017-01-05 at 23:29 +0800, Alex Shi wrote: The obsolete commit 71abbbf85 want to introduce a dynamic cstates, but it was removed for long time. Just left the nonsense deeper cstate checking. Since all target_residency and exit_latency are going longer in deeper idle state, no needs to waste some cpu cycle on useless seeking. Makes me wonder if it would be worth documenting the requirement that c-states be listed in increasing order? or better, a boot time quick check...
Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location
On 1/5/2017 12:11 AM, Ingo Molnar wrote: * Thomas Garnier wrote: Each processor holds a GDT in its per-cpu structure. The sgdt instruction gives the base address of the current GDT. This address can be used to bypass KASLR memory randomization. With another bug, an attacker could target other per-cpu structures or deduce the base of the main memory section (PAGE_OFFSET). In this change, a space is reserved at the end of the memory range available for KASLR memory randomization. The space is big enough to hold the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is mapped at specific offset based on the target CPU. Note that if there is not enough space available, the GDTs are not remapped. The document was changed to mention GDT remapping for KASLR. This patch also include dump page tables support. This patch was tested on multiple hardware configurations and for hibernation support. void kernel_randomize_memory(void); +void kernel_randomize_smp(void); +void* kaslr_get_gdt_remap(int cpu); Yeah, no fundamental objections from me to the principle, but I get some bad vibes from the naming here: seeing that kernel_randomize_smp() actually makes things less random. kernel_unrandomize_smp() ... one request.. can we make sure this unrandomization is optional?
Re: [PATCH] proc: Fix timerslack_ns CAP_SYS_NICE check when adjusting self
On 8/10/2016 12:03 PM, John Stultz wrote: I wasn't entierly sure. I didn't think PR_SET_TIMERSLACK has a security hook, but looking again I now see the top-level security_task_prctl() check, so maybe not skipping it in this case would be good? the easy fix would be to add back the ptrace check.. just either ptrace-able OR CAP_SYS_NICE ;) then you can prove you only added new stuff as well, and have all the LSM from before
Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface
On 7/14/2016 10:45 AM, Kees Cook wrote: On Thu, Jul 14, 2016 at 9:09 AM, John Stultz wrote: On Thu, Jul 14, 2016 at 5:48 AM, Serge E. Hallyn wrote: Quoting Kees Cook (keesc...@chromium.org): I think the original CAP_SYS_NICE should be fine. A malicious CAP_SYS_NICE process can do plenty of insane things, I don't feel like the timer slack adds to any realistic risks. Can someone give a detailed explanation of what you could do with the new timerslack feature and compare it to what you can do with sys_nice? Looking at the man page for CAP_SYS_NICE, it looks like such a task can set a task as SCHED_FIFO, so they could fork some spinning processes and set them all SCHED_FIFO 99, in effect delaying all other tasks for an infinite amount of time. So one might argue setting large timerslack vlaues isn't that different risk wise? Right -- you can hose a system with CAP_SYS_NICE already; I don't think timerslack realistically changes that. fair enough the worry of being able to time attack things is there already with the SCHED_FIFO so... purist objection withdrawn in favor of the pragmatic
Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface
On 7/14/2016 5:48 AM, Serge E. Hallyn wrote: Can someone give a detailed explanation of what you could do with the new timerslack feature and compare it to what you can do with sys_nice? what you can do with the timerslack feature is add upto 4 seconds of extra time/delay on top of each select()/poll()/nanosleep()/... (basically anything that uses hrtimers on behalf of the user), and then also control within that 4 second window exactly when that extra delay ends (which may help a timing attack kind of scenario)
Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface
On 7/13/2016 8:39 PM, Kees Cook wrote: So I worry I'm a bit stuck here. For general systems, CAP_SYS_NICE is too low a level of privilege to set a tasks timerslack, but apparently CAP_SYS_PTRACE is too high a privilege for Android's system_server to require just to set a tasks timerslack value. So I wanted to ask again if we might consider backing this down to CAP_SYS_NICE, or if we can instead introduce a new CAP_SYS_TIMERSLACK or something to provide the needed in-between capability level. Adding new capabilities appears to not really be viable (lots of threads about this...) I think the original CAP_SYS_NICE should be fine. A malicious CAP_SYS_NICE process can do plenty of insane things, I don't feel like the timer slack adds to any realistic risks. if the result is really as bad as you describe, then that is worse than the impact of this being CAP_SYS_NICE, and thus SYS_TRACE is maybe the purist answer, but not the pragmatic best answer; certainly I don't want to make the overall system security worse. I wonder how much you want to set the slack; one of the options (and I don't know how this will work in the code, if it's horrible don't do it) is to limit how much slack CAP_SYS_NICE can set (say, 50 or 100 msec, e.g. in the order of a "time slice" or two if Linux had time slices, similar to what nice would do) while CAP_SYS_TRACE can set the full 4 seconds. If it makes the code horrible, don't do it and just do SYS_NICE.
Re: [PATCH 1/8] x86: don't use module.h just for AUTHOR / LICENSE tags
On 7/13/2016 5:18 PM, Paul Gortmaker wrote: The Kconfig controlling compilation of these files are: arch/x86/Kconfig.debug:config DEBUG_RODATA_TEST arch/x86/Kconfig.debug: bool "Testcase for the marking rodata read-only" arch/x86/Kconfig.debug:config X86_PTDUMP_CORE arch/x86/Kconfig.debug: def_bool n ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modular infrastructure use, so that when reading the driver there is no doubt it is builtin-only. We delete the MODULE_LICENSE tag etc. since all that information is already contained at the top of the file in the comments. Cc: Arjan van de Ven Acked-by: Arjan van de Ven original these were tested as modules, but they really shouldn't be modules in the normal kernel (and aren't per Kconfig)
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek wrote: > > Umm. I'm not sure if you should be designing kernel... > > I have alarm clock application. It does sleep(60) many times till its > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds > without some very, very good reason. I'm fairly sure you shouldn't be designing alarm clock applications! Because on busy systems you get random (scheduler) delays added to your timer. Having said that, your example is completely crooked here, sleep() does not use these kernel timers, it uses hrtimers instead. (hrtimers also have slack, but an alarm clock application that is this broken would have the choice to set such slack to 0) What happened here is that these sigtimewait were actually not great, it is just about the only application visible interface that's still in jiffies/HZ, and in the follow-on patch set, Thomas converted them properly to hrtimers as well to make them both accurate and CONFIG_HZ independent.
Re: [patch V2 00/20] timer: Refactor the timer wheel
so is there really an issue? sounds like KISS principle can apply On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixner wrote: > On Mon, 20 Jun 2016, Arjan van de Ven wrote: >> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: >> > >> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a >> > 1000HZ >> >option so datacenter folks can use this and people who don't care and >> > want >> >better batching for power can use the 4ms thingy. >> >> >> if there really is one user of such long timers... could we possibly >> make that one robust against early fire of the timer? >> >> eg rule is: if you set timers > 37 hours, you need to cope with early timer >> fire > > The only user I found is networking contrack (5 days). Eric thought its not a > big problem if it fires earlier. > > Thanks, > > tglx >
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ >option so datacenter folks can use this and people who don't care and want >better batching for power can use the 4ms thingy. if there really is one user of such long timers... could we possibly make that one robust against early fire of the timer? eg rule is: if you set timers > 37 hours, you need to cope with early timer fire
Re: initialize a mutex into locked state?
On 6/17/2016 7:54 AM, Oleg Drokin wrote: Yes, we can add all sorts of checks that have various impacts on code readability, we can also move code around that also have code readability and CPU impact. But in my discussion with Arjan he said this is a new use case that was not met before and suggested to mail it to the list. I'm all in favor of having "end code" be as clear as possible wrt intent. (and I will admit this is an curious use case, but not an insane silly one) one other option is to make a wrapper mutex_init_locked( ) { mutex_init() mutex_trylock() } that way the wrapper can be an inline in a header, but doesn't need to touch a wide berth of stuff... while keeping the end code clear wrt intent
Re: [patch V2 00/20] timer: Refactor the timer wheel
>To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. the only likely problem cases are msleep(1) uses... but we could just map those to usleep(1000,2000) (imo we should anyway)
Re: [patch 13/20] timer: Switch to a non cascading wheel
I think there's 2 elements on the interface. 1) having a relative interface to the current time (avoid use of absolute jiffies in drivers) 2) having wallclock units. Making HZ always be 1000 is effectively doing that as well (1 msec after all) On Thu, Jun 16, 2016 at 8:43 AM, Thomas Gleixner wrote: > On Wed, 15 Jun 2016, Thomas Gleixner wrote: >> On Wed, 15 Jun 2016, Arjan van de Ven wrote: >> > what would 1 more timer wheel do? >> >> Waste storage space and make the collection of expired timers more expensive. >> >> The selection of the timer wheel properties is combination of: >> >> 1) Granularity >> >> 2) Storage space >> >> 3) Number of levels to collect > > So I came up with a slightly different solution for this. The problem case is > HZ=1000 and again looking at the data, there is no reason why we need actual > 1ms granularity for timer wheel timers. That's independent of the desired ms > based interfaces. > > We can simply run the wheel internaly with 4ms base level resolution and > degrade from there. That gives us 6 days+ and a simple cutoff at the capacity > of the 7th level wheel. > > 0 04 ms 0 ms -255 ms > 164 32 ms 256 ms - 2047 ms (256ms - ~2s) > 2 128 256 ms2048 ms - 16383 ms (~2s - ~16s) > 3 192 2048 ms (~2s) 16384 ms - 131071 ms (~16s - ~2m) > 4 25616384 ms (~16s) 131072 ms -1048575 ms (~2m - ~17m) > 5 320 131072 ms (~2m) 1048576 ms -8388607 ms (~17m - ~2h) > 6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h) > 7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d) > > That works really nice and has the interesting side effect that we batch in > the first level wheel which helps networking. I'll repost the series with the > other review points addressed later tonight. > > Btw, I also thought a bit more about the milliseconds interfaces. I think we > shouldn't invent new interfaces. The correct solution IMHO is to distangle the > scheduler tick frequency and jiffies. If we have that completely seperated > then we can do the following: > > 1) Force HZ=1000. That means jiffies and timer wheel units are 1ms. If the >tick frequency is != 1000 we simply increment jiffies in the tick by the >proper amount (4 @250 ticks/sec, 10 @100 ticks/sec). > >So all msec_to_jiffies() invocations compile out into nothing magically and >we can remove them gradually over time. > > 2) When we do that right, we can make the tick frequency a command line option >and just have a compiled in default. > > Thoughts? > > Thanks, > > tglx
Re: [patch 13/20] timer: Switch to a non cascading wheel
what would 1 more timer wheel do? On Wed, Jun 15, 2016 at 7:53 AM, Thomas Gleixner wrote: > On Tue, 14 Jun 2016, Eric Dumazet wrote: >> Original TCP RFCs tell timeout is infinite ;) >> >> Practically, conntrack has a 5 days timeout, but I really doubt anyone >> expects an idle TCP flow to stay 'alive' when nothing is sent for 5 >> days. > > So would 37hrs ~= 1.5 days be a reasonable cutoff or will stuff fall apart and > people be surprised? > > Thanks, > > tglx >
Re: [patch 13/20] timer: Switch to a non cascading wheel
evaluating a 120 hours timer ever 37 hours to see if it should fire... not too horrid. On Tue, Jun 14, 2016 at 9:28 AM, Thomas Gleixner wrote: > On Tue, 14 Jun 2016, Ingo Molnar wrote: >> * Thomas Gleixner wrote: >> > On Mon, 13 Jun 2016, Peter Zijlstra wrote: >> > > On Mon, Jun 13, 2016 at 08:41:00AM -, Thomas Gleixner wrote: >> > > > + >> > > > + /* Cascading, sigh... */ >> > > >> > > So given that userspace has no influence on timer period; can't we >> > > simply fail to support timers longer than 30 minutes? >> > > >> > > In anything really arming timers _that_ long? >> > >> > Unfortunately yes. Networking being one of those. Real cascading happens >> > once >> > in a blue moon, but it happens. >> >> So I'd really prefer it if we added a few more levels, a hard limit and got >> rid of >> the cascading once and for all! >> >> IMHO 'once in a blue moon' code is much worse than a bit more data overhead. > > I agree. If we add two wheel levels then we end up with: > > HZ 1000: 134217727 ms ~= 37 hours > HZ 250: 536870908 ms ~= 149 hours > HZ 100: 1342177270 ms ~= 372 hours > > Looking through all my data I found exactly one timeout which is insanely > large: 120 hours! > > That's net/netfilter/nf_conntrack_core.c: > setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct); > > Anything else is way below 37 hours. > > Thanks, > > tglx
Re: [patch 04/20] cpufreq/powernv: Initialize timer as pinned
On Mon, Jun 13, 2016 at 1:40 AM, Thomas Gleixner wrote: > mod_timer(&gpstates->timer, jiffies + msecs_to_jiffies(timer_interval)); are you sure this is right? the others did not get replaced by mod_timer().. (and this is more evidence that a relative API in msecs is what drivers really want)
Re: [patch 06/20] drivers/tty/metag_da: Initialize timer as pinned
I know it's not related to this patch, but it'd be nice to, as you're changing the api name anyway, make a mod_pinned_relative() so that more direct users of jiffies can go away... or even better, mod_pinned_relative_ms() so that these drivers also do not need to care about HZ. On Mon, Jun 13, 2016 at 1:40 AM, Thomas Gleixner wrote: > Pinned timers must carry that attribute in the timer itself. No functional > change. > > Signed-off-by: Thomas Gleixner > --- > drivers/tty/metag_da.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Index: b/drivers/tty/metag_da.c > === > --- a/drivers/tty/metag_da.c > +++ b/drivers/tty/metag_da.c > @@ -323,12 +323,12 @@ static void dashtty_timer(unsigned long > if (channel >= 0) > fetch_data(channel); > > - mod_timer_pinned(&poll_timer, jiffies + DA_TTY_POLL); > + mod_pinned(&poll_timer, jiffies + DA_TTY_POLL); > } > > static void add_poll_timer(struct timer_list *poll_timer) > { > - setup_timer(poll_timer, dashtty_timer, 0); > + setup_pinned_timer(poll_timer, dashtty_timer, 0); > poll_timer->expires = jiffies + DA_TTY_POLL; > > /* > >
Re: S3 resume regression [1cf4f629d9d2 ("cpu/hotplug: Move online calls to hotplugged cpu")]
Oh, and this was with acpi_idle. This machine already failed to resume from S3 with intel_idle since forever, as detailed in https://bugzilla.kernel.org/show_bug.cgi?id=107151 but acpi_idle worked fine until now. can you disable (in sysfs) all C states other than C0/C1 and see if that makes it go away? that would point at the problem pretty clearly...
Re: S3 resume regression [1cf4f629d9d2 ("cpu/hotplug: Move online calls to hotplugged cpu")]
On 5/11/2016 3:19 AM, Ville Syrjälä wrote: Oh, and this was with acpi_idle. This machine already failed to resume from S3 with intel_idle since forever, as detailed in https://bugzilla.kernel.org/show_bug.cgi?id=107151 but acpi_idle worked fine until now. this is the important clue part afaics. some of these very old Atom's had issues (bios?) with S3 if the cores were in a too-deep C state, and at some point there was a workaround (I forgot where in the code) to ban those deep C states around S3 on those cpus. I wonder if moving things around has made said workaround ineffective.
Re: [PATCH v5 3/9] x86/head: Move early exception panic code into early_fixup_exception
On 4/4/2016 8:32 AM, Andy Lutomirski wrote: Adding locking would be easy enough, wouldn't it? But do any platforms really boot a second CPU before switching to real printk? Given that I see all the smpboot stuff in dmesg, I guess real printk happens first. I admit I haven't actually checked. adding locking also makes things more fragile in terms of getting the last thing out before you go down in flaming death until it's a proven problem, this early, get the message out at all is more important than getting it out perfectly, sometimes.
Re: [PATCH] x86: Enable full randomization on i386 and X86_32.
Arjan, or other folks, can you remember why x86_32 disabled mmap randomization here? There doesn't seem to be a good reason for it that I see. for unlimited stack it got really messy with threaded apps. anyway, I don't mind seeing if this will indeed work, with time running out where 32 bit is going extinct... in a few years we just won't have enough testing on this kind of change anymore.
Re: [PATCH v2] arm64: add alignment fault hanling
On 2/16/2016 10:50 AM, Linus Torvalds wrote: On Tue, Feb 16, 2016 at 9:04 AM, Will Deacon wrote: [replying to self and adding some x86 people] Background: Euntaik reports a problem where userspace has ended up with a memory page mapped adjacent to an MMIO page (e.g. from /dev/mem or a PCI memory bar from someplace in /sys). strncpy_from_user happens with the word-at-a-time implementation, and we end up reading into the MMIO page. how does this work if the adjacent page is not accessible? or has some other magic fault handler, or is on an NFS filesystem where the server is rebooting? isn't the general rule for such basic functions "don't touch memory unless you KNOW it is there" Of course, no actual real program will do that for mixing MMIO and non-MMIO, and so we might obviously add code to always add a guard page for the normal case when a specific address isn't asked for. So as a heuristic to make sure it doesn't happen by mistake it possibly makes sense. but what happens to the read if the page isn't present? or is execute-only or .. or ..
Re: [PATCH] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
and most of the RT guys would only tolerate a little bit of it is there any real/practial use of going longer than 4 seconds? if there is then yeah fixing it makes sense. if it's just theoretical... shrug... 32 bit systems have a bunch of other limits/differences a well. So I'd think it would be mostly theoretical, but in my testing on a VM, setting the timerslack for bash to 10 secs made time sleep 1 take ~10.5 seconds. So its apparently not too hard to coalesce fairly far out (I need to spend a bit more time to verify that events really weren't happening during that time and we're not just doing unnecessary delays with the extra slack). 99% sure you're hitting something else; we look pretty much only 1 ahead in the queue for timers to run to see if they can be run, once we hit a timer that's not ready yet we stop. your 10 second ahead is behind a whole bunch of other not-ready ones so won't even be looked at until its close But yea. My main concern is that if we do a consistent 64bit interface for all arches in the /proc//timerslack_ns interface, it will make PR_GET_TIMERSLACK return incorrect results on 32bit systems when the slack is >= 2^32. or we return UINT_MAX for that case. not too hard.
Re: [PATCH] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
On 2/5/2016 4:51 PM, John Stultz wrote: On Fri, Feb 5, 2016 at 2:35 PM, John Stultz wrote: On Fri, Feb 5, 2016 at 12:50 PM, Andrew Morton wrote: On Fri, 5 Feb 2016 12:44:04 -0800 Kees Cook wrote: Could this be exposed as a writable /proc entry instead? Like the oom_* stuff? /proc//timer_slack_ns, guarded by ptrace_may_access(), documented under Documentation/? Yup, that would work. It's there for all architectures from day one and there is precedent. It's not as nice, but /proc nasties will always be with us. Ok. I'll start working on that. Arjan/Thomas: One curious thing I noticed here while writing some documentation. The timer_slack_ns value in the task struct is a unsigned long. So this means PR_SET_TIMERSLACK limits the maximum slack on 32 bit machines to ~4 seconds. Where on 64bit machines it can be quite a bit longer (unreasonably long, really :). originally when we created timerslack, 4 seconds was an eternity and good enough for everyone by a mile... (assumption was practical upper limit being in the 15 msec range) and most of the RT guys would only tolerate a little bit of it is there any real/practial use of going longer than 4 seconds? if there is then yeah fixing it makes sense. if it's just theoretical... shrug... 32 bit systems have a bunch of other limits/differences a well.
Re: [RFC][PATCH v2] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
On 1/25/2016 8:28 PM, John Stultz wrote: From: Ruchi Kandoi This allows power/performance management software to set timer slack for other threads according to its policy for the thread (such as when the thread is designated foreground vs. background activity) Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the slack is set to that value otherwise sets it to the default for the thread. Takes PID of the thread as the third argument. This interface checks that the calling task has permissions to to use PTRACE_MODE_ATTACH on the target task, so that we can ensure arbitrary apps do not change the timer slack for other apps. Acked-by: Arjan van de Ven only slight concern is the locking around the value of the field in the task struct, but nobody does read-modify-write on it, so they'll get either the new or the old version, which should be ok. (until now only the local thread would touch the field, and if you're setting it, by definition you're not going to sleep yet, so you're not using the field)
Re: 4.4-rc5: ugly warn on: 5 W+X pages found
On 12/14/2015 11:56 PM, Pavel Machek wrote: On Mon 2015-12-14 13:24:08, Arjan van de Ven wrote: That's weird. The only API to do that seems to be manually setting kmap_prot to _PAGE_KERNEL_EXEC, and nothing does that. (Why is kmap_prot a variable on x86 at all? It has exactly one writer, and that's the code that initializes it in the first place. Shouldn't we #define kmap_prot _PAGE_KERNEL? iirc it changes based on runtime detection of NX capability Huh. Is it possible that core duo is so old that it has no NX? really stupid question I guess, but is PAE on ? (64 bit pagetables are required for NX) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.4-rc5: ugly warn on: 5 W+X pages found
That's weird. The only API to do that seems to be manually setting kmap_prot to _PAGE_KERNEL_EXEC, and nothing does that. (Why is kmap_prot a variable on x86 at all? It has exactly one writer, and that's the code that initializes it in the first place. Shouldn't we #define kmap_prot _PAGE_KERNEL? iirc it changes based on runtime detection of NX capability -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] sched: introduce synchronized idle injection
On 11/18/2015 7:44 AM, Morten Rasmussen wrote: I would not necessarily want to punish all cpus system-wide if we have local overheating in one corner. If would rather have it apply to only the overheating socket in a multi-socket machine and only the big cores in a big.LITTLE system. most of the time thermal issues aren't inside the SOC, but on a system level due to cheap heat spreaders or outright lack of space due to thinness. But even if you have one part of the die too hot: For core level idle injection, no need to synchronize that; the reason to synchronize is generally that when ALL cores are idle, additional power savings kick in (like memory going to self refresh, fabrics power gating etc); those additional power savings are what makes this more efficient than just voltage/frequency scaling at the bottom of that range... not so much the fact that things are just idle. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] sched: introduce synchronized idle injection
On 11/18/2015 12:36 AM, Ingo Molnar wrote: What will such throttling do to latencies, as observed by user-space tasks? What's the typical expected frequency of the throttling frequency that you are targeting? for this to meaningfully reduce power consumption, deep system power states need to be reached, and for those to pay of, several milliseconds of idle are required. for hard realtime stuff that is obviously insane, but I would assume that for those cases your system is thermally fine. This only kicks in at the end of a "we're in thermal problems" path, which can happen both on clients (thin devices) as well as servers (airconditioning issues). The objective for this is to kick in before the hardware built-in protections kick in (which are power off/reboot depending on a bunch of things). The frequency of how often these 5 msec get injected depend on how deep the system is in trouble; and is zero if the system is not in trouble. The idea is that for the user it is better to inject several 5 msec intervals than it is to inject one longer period. You can compare this method to other ways of reducing thermal issues (like lowering cpu frequency), and in a typical setup, this is done after the benign of those methods are exhausted. Lowering frequency even lower is usually of a low efficiency (you need to lower the frequency a LOT to gain a little bit of power in the bottom parts of the frequency ranges), while this idle will not only put the CPU in low power, but will also put the system memory in low power and usually a big chunk of the rest of the SOC. In many client systems, memory power consumption is higher than CPU power consumption (and in big servers, it's also quite sizable), so there is a pretty hard limit of how much you can do on thermals if you're not also kicking some of the memory power savings. This means that to achieve a certain amount of reduction, the performance is impacted a lot less than the more drastic methods you would need on the cpu side, if possible at all. (stepping your 2 Ghz cpu down to 50Mhz may sound less evil than injecting 5msec of idle time, but in reality that is impacting user tasks a heck of a lot more than 5msec of not being scheduled) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] timer: relax tick stop in idle entry
On 11/16/2015 6:53 PM, Paul E. McKenney wrote: Fair point. When in the five-jiffy throttling state, what can wake up a CPU? In an earlier version of this proposal, the answer was "nothing", but maybe that has changed. device interrupts are likely to wake the cpus. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] timer: relax tick stop in idle entry
On 11/16/2015 3:28 PM, Paul E. McKenney wrote: Is this mostly an special-purpose embedded thing, or do you expect distros to be enabling this? If the former, I suggest CONFIG_RCU_NOCB_CPU_ALL, but if distros are doing this for general-purpose workloads, I instead suggest CONFIG_RCU_FAST_NO_HZ. thermal overload happens a lot on small devices, but sadly also in big datacenters where it is not uncommon to underprovision cooling capacity by a bit (it's one of those "99% of the time you only need THIS much, the 1% you need 30% more" and that more is expensive or even impractical) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection
On 11/5/2015 7:32 AM, Jacob Pan wrote: On Thu, 5 Nov 2015 15:33:32 +0100 Peter Zijlstra wrote: On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote: On 11/5/2015 2:09 AM, Peter Zijlstra wrote: I can see such a scheme having a fairly big impact on latency, esp. with forced idleness such as this. That's not going to be popular for many workloads. idle injection is a last ditch effort in thermal management, before this gets used the hardware already has clamped you to a low frequency, reduced memory speeds, probably dimmed your screen etc etc. Just to clarify, the low frequency here is not necessarily the minimum frequency. It is usually the Pe (max efficiency). to translate that from Intelese to English: The system already is at the lowest frequency that's relatively efficient. To go even lower in instant power consumption (e.g. heat) by even a little bit, a LOT of frequency needs to be sacrificed. Idle injection sucks. But it's more efficient (at the point that it would get used) than any other methods, so it also sucks less than those other methods for the same amount of reduction in heat generation. It only gets used if the system HAS to reduce the heat generation, either because it's a mobile device with little cooling capacity, or because the airconditioning in your big datacenter is currently not able to keep up. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection
On 11/5/2015 6:33 AM, Peter Zijlstra wrote: On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote: On 11/5/2015 2:09 AM, Peter Zijlstra wrote: I can see such a scheme having a fairly big impact on latency, esp. with forced idleness such as this. That's not going to be popular for many workloads. idle injection is a last ditch effort in thermal management, before this gets used the hardware already has clamped you to a low frequency, reduced memory speeds, probably dimmed your screen etc etc. at this point there are 3 choices 1) Shut off the device 2) do uncoordinated idle injection for 40% of the time 3) do coordinated idle injection for 5% of the time as much as force injecting idle in a synchronized way sucks, the alternatives are worse. OK, it wasn't put that way. I figured it was a way to use less power on any workload with idle time on. so idle injection (as with pretty much every thermal management feature) is NOT a way to save on battery life. Every known method pretty much ends up sacrificing more in terms of performance than you gain in instant power that over time you end up using more (drain battery basically). idle injection, if synchronized, is one of the more effective ones, e.g. give up the least efficiency compared to, say, unsynchronized or even inserting idle cycles in the CPU (T-states)... not even speaking of just turning the system off. That said; what kind of devices are we talking about here; mobile with pittyful heat dissipation? Surely a well designed server or desktop class system should never get into this situation in the first place. a well designed server may not, but the datacenter it is in may. for example if the AC goes out, but also, sometimes the datacenter's peak heat dissapation can exceed the AC capacity (which is outside temperature dependent.. yay global warming), which may require an urgent reduction over a series of machines for the duration of the peak load/peak temperature (usually just inserting a little bit, say 1%, over all servers will do) It just grates at me a bit that we have to touch hot paths for such scenarios :/ well we have this as a driver right now that does not touch hot paths, but it seems you and tglx also hate that approach with a passion -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection
On 11/5/2015 2:09 AM, Peter Zijlstra wrote: I can see such a scheme having a fairly big impact on latency, esp. with forced idleness such as this. That's not going to be popular for many workloads. idle injection is a last ditch effort in thermal management, before this gets used the hardware already has clamped you to a low frequency, reduced memory speeds, probably dimmed your screen etc etc. at this point there are 3 choices 1) Shut off the device 2) do uncoordinated idle injection for 40% of the time 3) do coordinated idle injection for 5% of the time as much as force injecting idle in a synchronized way sucks, the alternatives are worse. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] cpuidle,menu: smooth out measured_us calculation
On 11/3/2015 2:34 PM, r...@redhat.com wrote: Furthermore, for smaller sleep intervals, we know the chance that all the cores in the package went to the same idle state are fairly small. Dividing the measured_us by two, instead of subtracting the full exit latency when hitting a small measured_us, will reduce the error. there is no perfect answer for this issue; but at least this makes the situation a lot better, so Acked-by: Arjan van de Ven -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] cpuidle,x86: increase forced cut-off for polling to 20us
Acked-by: Arjan van de Ven -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] cpuidle,menu: use interactivity_req to disable polling
On 11/3/2015 2:34 PM, r...@redhat.com wrote: From: Rik van Riel The menu governor carefully figures out how much time we typically sleep for an estimated sleep interval, or whether there is a repeating pattern going on, and corrects that estimate for the CPU load. Then it proceeds to ignore that information when determining whether or not to consider polling. This is not a big deal on most x86 CPUs, which have very low C1 latencies, and the patch should not have any effect on those CPUs. However, certain CPUs (eg. Atom) have much higher C1 latencies, and it would be good to not waste performance and power on those CPUs if we are expecting a very low wakeup latency. Disable polling based on the estimated interactivity requirement, not on the time to the next timer interrupt. good catch! Acked-by: Arjan van de Ven -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/mm] x86/mm: Warn on W^X mappings
On 10/8/2015 7:57 AM, Borislav Petkov wrote: + pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages found.\n"); Do we really want to issue anything here in the success case? IMO, we should be quiet if the check passes and only scream when something's wrong... I would like the success message to be there. From an automated testing perspective (for the distro I work on for example), "the test runs and it fails", "the test runs and it passes" and "the test has not run (because of a bug in the code or config file)" are different outcomes, where the first and third are test failures, but without the pr_info at info level, the 2nd and 3rd are indistinguishable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/mm] x86/mm: Warn on W^X mappings
On 10/6/2015 2:54 AM, tip-bot for Stephen Smalley wrote: Commit-ID: e1a58320a38dfa72be48a0f1a3a92273663ba6db Gitweb: http://git.kernel.org/tip/e1a58320a38dfa72be48a0f1a3a92273663ba6db Author: Stephen Smalley AuthorDate: Mon, 5 Oct 2015 12:55:20 -0400 Committer: Ingo Molnar CommitDate: Tue, 6 Oct 2015 11:11:48 +0200 x86/mm: Warn on W^X mappings Warn on any residual W+X mappings after setting NX if DEBUG_WX is enabled. Introduce a separate X86_PTDUMP_CORE config that enables the code for dumping the page tables without enabling the debugfs interface, so that DEBUG_WX can be enabled without exposing the debugfs interface. Switch EFI_PGT_DUMP to using X86_PTDUMP_CORE so that it also does not require enabling the debugfs interface. I like it, so Acked-by: Arjan van de Ven I also have/had an old userland script to do similar checks but using the debugfs interface... ... would that be useful to have somewhere more central? http://git.fenrus.org/tmp/i386-check-pagetables.pl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On 9/21/2015 9:36 AM, Linus Torvalds wrote: On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar wrote: Linus, what's your preference? So quite frankly, is there any reason we don't just implement native_read_msr() as just unsigned long long native_read_msr(unsigned int msr) { int err; unsigned long long val; val = native_read_msr_safe(msr, &err); WARN_ON_ONCE(err); return val; } Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be done with it. I don't see the downside. How many msr reads are so critical that the function call overhead would matter? if anything qualifies it'd be switch_to() and friends. note that I'm not entirely happy about the notion of "safe" MSRs. They're safe as in "won't fault". Reading random MSRs isn't a generic safe operation though, but the name sort of gives people the impression that it is. Even with _safe variants, you still need to KNOW the MSR exists (by means of CPUID or similar) unfortunately. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
On 9/17/2015 8:29 AM, Paolo Bonzini wrote: On 17/09/2015 17:27, Arjan van de Ven wrote: ( We should double check that rdmsr()/wrmsr() results are never left uninitialized, but are set to zero or so, for cases where the return code is not checked. ) It sure looks like native_read_msr_safe doesn't clear the output if the rdmsr fails. I'd suggest to return some poison not just 0... What about 0 + WARN? why 0? 0xdeadbeef or any other pattern (even 0x3636363636) makes more sense (of course also WARN... but most folks don't read dmesg for WARNs) (it's the same thing we do for list or slab poison stuff) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
( We should double check that rdmsr()/wrmsr() results are never left uninitialized, but are set to zero or so, for cases where the return code is not checked. ) It sure looks like native_read_msr_safe doesn't clear the output if the rdmsr fails. I'd suggest to return some poison not just 0... less likely to get interesting surprises that are insane hard to debug/diagnose -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: V4.0.x fails to create /dev/rtc0 on Winbook TW100 when CONFIG_PINCTRL_BAYTRAIL is set, bisected to commit 7486341
On 7/11/2015 11:26 AM, Porteus Kiosk wrote: Hello Arjan, We need it for setting up the time in the hardware clock through the 'hwclock' command. Thank you. hmm thinking about it after coffee... there is an RTC that can be exposed to userspace. hrmpf. Wonder why its not there for you -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: V4.0.x fails to create /dev/rtc0 on Winbook TW100 when CONFIG_PINCTRL_BAYTRAIL is set, bisected to commit 7486341
On 7/11/2015 11:21 AM, Arjan van de Ven wrote: On 7/11/2015 10:59 AM, Larry Finger wrote: On a Winbook TW100 BayTrail tablet, kernel 4.0 and later do not create /dev/rtc0 when CONFIG_PINCTRL_BAYTRAIL is set in the configuration. Removing this option from the config creates a real-time clock; however, it is no longer possible to get the tablet to sleep using the power button. Only complete shutdown works. This problem was bisected to the following commit: in "hardware reduced mode" (e.g. tablets) on Baytrail the RTC is not actually enabled/initialized by the firmware; talking to it may appear to work but it's really not a good idea (and breaks things likes suspend/resume etc). (or in other words, many of the legacy PC things are not supposed to be there) what did you want to use rtc0 for? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/