Re: [PATCH] x86/pci: fix intel_mid_pci.c build error when ACPI is not enabled

2020-08-13 Thread Arjan van de Ven

On 8/13/2020 12:58 PM, Randy Dunlap wrote:

From: Randy Dunlap 

Fix build error when CONFIG_ACPI is not set/enabled by adding
the header file  which contains a stub for the function
in the build error.

../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’:
../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function 
‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? 
[-Werror=implicit-function-declaration]
   acpi_noirq_set();

Signed-off-by: Randy Dunlap 
Cc: Jacob Pan 
Cc: Len Brown 
Cc: Bjorn Helgaas 
Cc: Jesse Barnes 
Cc: Arjan van de Ven 
Cc: linux-...@vger.kernel.org
---
Found in linux-next, but applies to/exists in mainline also.

Alternative.1: X86_INTEL_MID depends on ACPI
Alternative.2: drop X86_INTEL_MID support


at this point I'd suggest Alternative 2; the products that needed that (past 
tense, that technology
is no longer need for any newer products) never shipped in any form where a 4.x 
or 5.x kernel could
work, and they are also all locked down...



Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time

2019-02-20 Thread Arjan van de Ven

On 2/20/2019 7:35 AM, David Laight wrote:

From: Sent: 16 February 2019 12:56

To: Li, Aubrey

...

The above experiment just confirms what I said: The numbers are inaccurate
and potentially misleading to a large extent when the AVX using task is not
scheduled out for a longer time.


Not only that, they won't detect programs that use AVX-512 but never
context switch with live AVX-512 registers.



you are completely correct in stating that this approach is basically sampling
at a relatively course level and such sampling will give false negatives

the alternative is not sampling, and not knowing anything at all,
unless you have a better suggestion on how to help find tasks that use avx512
in a low overhead way

(the typical use case is trying to find workloads that use avx512 to help
scheduling those workloads in the future in the cloud orchestrator, for example
to help them favor machines that support avx512 over machines that don't)


Re: [PATCH] x86/speculation: Add document to describe Spectre and its mitigations

2019-01-14 Thread Arjan van de Ven

On 1/14/2019 5:06 AM, Jiri Kosina wrote:

On Mon, 14 Jan 2019, Pavel Machek wrote:


Frankly I'd not call it Meltdown, as it works only on data in the cache,
so the defense is completely different. Seems more like a l1tf
:-).


Meltdown on x86 also seems to work only for data in L1D, but the pipeline
could be constructed in a way that data are actually fetched into L1D
before speculation gives up, which is not the case on ppc (speculation
aborts on L2->L1 propagation IIRC). That's why flushing L1D on ppc is
sufficient, but on x86 it's not.


assuming L1D is not shared between SMT threads obviously :)







Re: [PATCH] x86/speculation: Add document to describe Spectre and its mitigations

2018-12-31 Thread Arjan van de Ven

On 12/31/2018 8:22 AM, Ben Greear wrote:



On 12/21/2018 05:17 PM, Tim Chen wrote:

On 12/21/18 1:59 PM, Ben Greear wrote:

On 12/21/18 9:44 AM, Tim Chen wrote:

Thomas,

Andi and I have made an update to our draft of the Spectre admin guide.
We may be out on Christmas vacation for a while.  But we want to
send it out for everyone to take a look.


Can you add a section on how to compile out all mitigations that have anything
beyond negligible performance impact for those running systems where performance
is more important than security?



If you don't worry about security and performance is paramount, then
boot with "nospectre_v2".  That's explained in the document.


There seem to be lots of different variants of this type of problem.  It was 
not clear
to me that just doing nospectre_v2 would be sufficient to get back full 
performance.

And anyway, I would like to compile the kernel to not need that command-line 
option,
so I am still interesting in what compile options need to be set to what 
values...


the cloud people call this scenario "single tenant".. there might be different 
"users" in the uid
sense, but they're all owned by the same folks


it would not be insane to make a CONFIG_SINGLE_TENANT kind of option under 
which we can group thse kind of things
(and likely others)


Re: WARNING in __rcu_read_unlock

2018-12-17 Thread Arjan van de Ven

On 12/17/2018 3:29 AM, Paul E. McKenney wrote:

As does this sort of report on a line that contains simple integer
arithmetic and boolean operations.;-)

Any chance of a bisection?


btw this looks like something caused a stack overflow and thus all the 
weirdness that then happens



Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Arjan van de Ven

On 12/11/2018 3:46 PM, Li, Aubrey wrote:

On 2018/12/12 1:18, Dave Hansen wrote:

On 12/10/18 4:24 PM, Aubrey Li wrote:

The tracking turns on the usage flag at the next context switch of
the task, but requires 3 consecutive context switches with no usage
to clear it. This decay is required because well-written AVX-512
applications are expected to clear this state when not actively using
AVX-512 registers.


One concern about this:  Given a HZ=1000 system, this means that the
flag needs to get scanned every ~3ms.  That's a pretty good amount of
scanning on a system with hundreds or thousands of tasks running around.

How many tasks does this scale to until you're eating up an entire CPU
or two just scanning /proc?



Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.


maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
time we found the task to use avx? (need to find a good time unit)



Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode

2018-12-04 Thread Arjan van de Ven

On processors with enhanced IBRS support, we recommend setting IBRS to 1
and left set.


Then why doesn't CPU with EIBRS support acutally *default* to '1', with
opt-out possibility for OS?


(slightly longer answer)

you can pretty much assume that on these CPUs, IBRS doesn't actually do anything
(e.g. just a scratch bit)

we could debate (and did :-)) for some time what the default value should be at 
boot,
but it kind of is one of those minor issues that should not hold up getting 
things out.

it could well be that the cpus that do this will ship with 1 as default, but 
it's hard to
guarantee across many products and different CPU vendors when time was tight.



Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode

2018-12-04 Thread Arjan van de Ven

On processors with enhanced IBRS support, we recommend setting IBRS to 1
and left set.


Then why doesn't CPU with EIBRS support acutally *default* to '1', with
opt-out possibility for OS?


the BIOSes could indeed get this set up this way.

do you want to trust the bios to get it right?


Re: [patch 01/24] x86/speculation: Update the TIF_SSBD comment

2018-11-21 Thread Arjan van de Ven

On 11/21/2018 2:53 PM, Borislav Petkov wrote:

On Wed, Nov 21, 2018 at 11:48:41PM +0100, Thomas Gleixner wrote:

Btw, I really do not like the app2app wording. I'd rather go for usr2usr,
but that's kinda horrible as well. But then, all of this is horrible.

Any better ideas?


It needs to have "task isolation" in there somewhere as this is what it
does, practically. But it needs to be more precise as in "isolates the
tasks from influence due to shared hardware." :)



part of the problem is that "sharing" has multiple dimensions: time and space 
(e.g. hyperthreading)
which makes it hard to find a nice term for it other than describing who 
attacks whom



Re: STIBP by default.. Revert?

2018-11-20 Thread Arjan van de Ven

On 11/20/2018 11:27 PM, Jiri Kosina wrote:

On Mon, 19 Nov 2018, Arjan van de Ven wrote:


In the documentation, AMD officially recommends against this by default,
and I can speak for Intel that our position is that as well: this really
must not be on by default.


Thanks for pointing to the AMD doc, it's indeed clearly stated there.

Is there any chance this could perhaps be added to Intel documentation as
well, so that we avoid cases like this in the future?


absolutely that's now already in progress;
the doc publishing process is a bit on the long side unfortunately so it won't
be today ;)


Re: Re: STIBP by default.. Revert?

2018-11-18 Thread Arjan van de Ven

On 11/19/2018 6:00 AM, Linus Torvalds wrote:

On Sun, Nov 18, 2018 at 1:49 PM Jiri Kosina  wrote:



So why do that STIBP slow-down by default when the people who *really*
care already disabled SMT?


BTW for them, there is no impact at all.


Right. People who really care about security and are anal about it do
not see *any* advantage of the patch.


In the documentation, AMD officially recommends against this by default, and I 
can
speak for Intel that our position is that as well: this really must not be on 
by default.

STIBP and its friends are there as tools, and were created early on as big 
hammers because
that is all that one can add in a microcode update.. expensive big hammers.

In some ways it's analogous to the "disable caches" bit in CR0. sure it's there 
as a big hammer,
but you don't set that always just because caches could be used for a side 
channel

Using these tools much more surgically is fine, if a paranoid task wants it for 
example,
or when you know you are doing a hard core security transition. But always on? 
Yikes.



Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state

2018-11-12 Thread Arjan van de Ven




I'd prefer the kernel to do such clustering...


I think that is a next step.

Also, while the kernel can do this at a best effort basis, it cannot
take into account things the kernel doesn't know about, like high
priority job peak load etc.., things a job scheduler would know.

Then again, a job scheduler would likely already know about the AVX
state anyway.


the job scheduler can guess.
unless it can also *measure* it won't know for sure...

so even in that scenario having a decent way to report actuals is useful








Re: [RFC] x86, tsc: Add kcmdline args for skipping tsc calibration sequences

2018-07-13 Thread Arjan van de Ven

On 7/13/2018 12:19 PM, patrickg wrote:

This RFC patch is intended to allow bypass CPUID, MSR and QuickPIT calibration 
methods should the user desire to.

The current ordering in ML x86 tsc is to calibrate in the order listed above; 
returning whenever there's a successful calibration.  However there are certain 
BIOS/HW Designs for overclocking that cause the TSC to change along with the 
max core clock; and simple 'trusting' calibration methodologies will lead to 
the TSC running 'faster' and eventually, TSC instability.




that would be a real violation of the contract between cpu and OS: tsc is not 
supposed to change for the duration of the boot


I only know that there's a use-case for me to want to be able to skip CPUID 
calibration, however I included args for skipping all the rest just so that all 
functionality is covered in the long run instead of just one use-case.


wouldn't it be better to start the detailed calibration with the value from 
CPUID instead; that way we also properly calibrate spread spectrum etc...

I thought we switched to that recently to be honest...


Re: [RFC][PATCH] x86: proposed new ARCH_CAPABILITIES MSR bit for RSB-underflow

2018-02-16 Thread Arjan van de Ven

On 2/16/2018 11:43 AM, Linus Torvalds wrote:

On Fri, Feb 16, 2018 at 11:38 AM, Linus Torvalds
 wrote:


Of course, your patch still doesn't allow for "we claim to be skylake
for various other independent reasons, but the RSB issue is fixed".


.. maybe nobody ever has a reason to do that, though?


yeah I would be extremely surprised


Who knows, virtualization people may simply want the user to specify
the model, but then make the Spectre decisions be based on actual
hardware capabilities (whether those are "current" or "some minimum
base").


once you fake to be skylake when you're not, you do that for a reason; normallyt
that reason is that you COULD migrate to a skylake.
(and migration is not supposed to be visble to the guest OS)

and at that point you are a skylake for all intents and purposes.

(and the virtualization people also really hate it when the hardware
burst the bubble of this fakeing hardware to be not what it is)


Re: [PATCH] platform/x86: intel_turbo_max_3: Remove restriction for HWP platforms

2018-02-14 Thread Arjan van de Ven

On 2/14/2018 11:29 AM, Andy Shevchenko wrote:

On Mon, Feb 12, 2018 at 9:50 PM, Srinivas Pandruvada
 wrote:

On systems supporting HWP (Hardware P-States) mode, we expected to
enumerate core priority via ACPI-CPPC tables. Unfortunately deployment of
TURBO 3.0 didn't use this method to show core priority. So users are not
able to utilize this feature in HWP mode.

So remove the loading restriction of this driver for HWP enabled systems.
Even if there are some systems, which are providing the core priority via
ACPI CPPC, this shouldn't cause any conflict as the source of priority
definition is same.



Pushed to my review and testing queue, thanks!

P.S. Should it go to stable?


older stable at least did not have the problem





Re: [PATCH 4.9 43/92] x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown

2018-02-13 Thread Arjan van de Ven


So, any hints on what you think should be the correct fix here?


the patch sure looks correct to me, it now has a nice table for CPU IDs
including all of AMD (and soon hopefully the existing Intel ones that are not 
exposed to meltdown)





Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-31 Thread Arjan van de Ven

On 1/31/2018 2:15 AM, Thomas Gleixner wrote:


Good luck with making all that work.


on the Intel side we're checking what we can do that works and doesn't break
things right now; hopefully we just end up with a bit in the arch capabilities
MSR for "you should do RSB stuffing" and then the HV's can emulate that.

(people sometimes think that should be a 5 minute thing, but we need to check
many cpu models/etc to make sure a bit we pick is really free etc which makes
it take longer than some folks have patience for)




Re: [PATCH] x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel

2018-01-30 Thread Arjan van de Ven

On 1/30/2018 5:11 AM, Borislav Petkov wrote:

On Tue, Jan 30, 2018 at 01:57:21PM +0100, Thomas Gleixner wrote:

So much for the theory. That's not going to work. If the boot cpu has the
feature then the alternatives will have been applied. So even if the flag
mismatch can be observed when a secondary CPU comes up the outcome will be
access to a non existing MSR and #GP.


Yes, with mismatched microcode we're f*cked.


I think in the super early days of SMP there was an occasional broken BIOS.
(and when Linux then did the ucode update it was sane again)

Not since a long time though (I think the various certification suites check 
for it now)



So my question is: is there such microcode out there or is this
something theoretical which we want to address?


at this point it's insane theoretical; no OS can actually cope with this, so
if you're an OEM selling this, your customer can run zero OSes ;-)




(.. and adressing this will be ugly, no matter what.)

And if I were able to wish, I'd like to blacklist that microcode in
dracut so that it doesn't come anywhere near my system.


I'm not sure what you'd want dracut to do... panic() the system
on such a bios?



Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-30 Thread Arjan van de Ven

On 1/29/2018 7:32 PM, Linus Torvalds wrote:

On Mon, Jan 29, 2018 at 5:32 PM, Arjan van de Ven  wrote:


the most simple solution is that we set the internal feature bit in Linux
to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest
in a VM.


That sounds reasonable.

However, wouldn't it be even better to extend on the current cpuid
model, and actually have some real architectural bits in there.

Maybe it could be a bit in that IA32_ARCH_CAPABILITIES MSR. Say, add a
bit #2 that says "ret falls back on BTB".

Then that bit basically becomes the "Skylake bit". Hmm?


we can try to do that, but existing systems don't have that, and then we
get in another long thread here about weird lists of stuff ;-)



Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-29 Thread Arjan van de Ven

On 1/29/2018 4:23 PM, Linus Torvalds wrote:


Why do you even _care_ about the guest, and how it acts wrt Skylake?
What you should care about is not so much the guests (which do their
own thing) but protect guests from each other, no?


the most simple solution is that we set the internal feature bit in Linux
to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest in 
a VM.

The stuffing is not free, but it's also not insane either... so if it's turned 
on in guests,
the impact is still limited, while bare metal doesn't need it at all


Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-29 Thread Arjan van de Ven

On 1/29/2018 12:42 PM, Eduardo Habkost wrote:

The question is how the hypervisor could tell that to the guest.
If Intel doesn't give us a CPUID bit that can be used to tell
that retpolines are enough, maybe we should use a hypervisor
CPUID bit for that?


the objective is to have retpoline be safe everywhere and never use IBRS
(Linus was also pretty clear about that) so I'm confused by your question


Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation

2018-01-26 Thread Arjan van de Ven

On 1/26/2018 10:11 AM, David Woodhouse wrote:


I am *actively* ignoring Skylake right now. This is about per-SKL
userspace even with SMEP, because we think Intel's document lies to us.


if you think we lie to you then I think we're done with the conversation?

Please tell us then what you deploy in AWS for your customers ?

or show us research that shows we lied to you?


Re: [PATCH v3 5/6] x86/pti: Do not enable PTI on processors which are not vulnerable to Meltdown

2018-01-26 Thread Arjan van de Ven

On 1/26/2018 7:27 AM, Dave Hansen wrote:

On 01/26/2018 04:14 AM, Yves-Alexis Perez wrote:

I know we'll still be able to manually enable PTI with a command line option,
but it's also a hardening feature which has the nice side effect of emulating
SMEP on CPU which don't support it (e.g the Atom boxes above).


For Meltdown-vulnerable systems, it's a no brainer: pti=on.  The
vulnerability there is just too much.

But, if we are going to change the default, IMNHO, we need a clear list
of what SMEP emulation mitigates and where.  RSB-related Variant 2 stuff
on Atom where the kernel speculatively 'ret's back to userspace is
certainly a concern.  But, there's a lot of other RSB stuffing that's
going on that will mitigate that too.

Were you thinking of anything concrete?


not Atom though. Atom has has SMEP for a very long time, at least the ones
that do speculation do afaict.

SMEP is for other bugs (dud kernel function pointer) and for that,
emulating SMEP is an interesting opt-in for sure.





Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

2018-01-25 Thread Arjan van de Ven


This patch tries to address the case when we do switch to init_mm and back.
Do you still have objections to the approach in this patch
to save the last active mm before switching to init_mm?


how do you know the last active mm did not go away and started a new process 
with new content?
(other than taking a reference which has other side effects)


Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

2018-01-25 Thread Arjan van de Ven

The idea is simple, do what we do for virt. Don't send IPI's to CPUs
that don't need them (in virt's case because the vCPU isn't running, in
our case because we're not in fact running a user process), but mark the
CPU as having needed a TLB flush.


I am really uncomfortable with that idea.
You really can't run code safely on a cpu where the TLBs in the CPU are invalid
or where a CPU that does (partial) page walks would install invalid PTEs either
through actual or through speculative execution.

(in the virt case there's a cheat, since the code is not actually running
there isn't a cpu with TLBs live. You can't do that same cheat for this case)



Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

2018-01-25 Thread Arjan van de Ven

On 1/25/2018 5:50 AM, Peter Zijlstra wrote:

On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:


This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.

Please clarify how you think it does.



the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.


The intel_idle code does, not the idle code. This is squirreled away in
some driver :/


afaik (but haven't looked in a while) acpi drivers did too



(trust me, that you really want, sequentially IPI's a pile of cores in a deep 
sleep
state to just flush a tlb that's empty, the performance of that is horrific)


Hurmph. I'd rather fix that some other way than leave_mm(), this is
piling special on special.


the problem was tricky. but of course if something better is possible lets 
figure this out

problem is that an IPI to an idle cpu is both power inefficient and will take 
time,
exit of a deep C state can be, say 50 to 100 usec range of time (it varies by 
many things, but
for abstractly thinking about the problem one should generally round up to nice 
round numbers)

if you have say 64 cores that had the mm at some point, but 63 are in idle, the 
64th
really does not want to IPI each of those 63 serially (technically this is does 
not need
to be serial but IPI code is tricky, some things end up serializing this a bit)
to get the 100 usec hit 63 times. Actually, even if it's not serialized, even 
ONE hit of 100 usec
is unpleasant.

so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as 
general objective.

but not getting flush IPIs is only safe if the TLBs in the CPU have nothing 
that such IPI would
want to flush, so the TLB needs to be empty of those things.

the only way to do THAT is to switch to an mm that is safe; a leave_mm() does 
this, but I'm sure other
options exist.

note: While a CPU that is in a deeper C state will itself flush the TLB, you 
don't know if you will actually
enter that deep at the time of making OS decisions (if an interrupt comes in 
the cycle before mwait, mwait
becomes a nop for example). In addition, once you wake up, you don't want the 
CPU to go start filling
the TLBs with invalid data so you can't really just set a bit and flush after 
leaving idle


Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

2018-01-25 Thread Arjan van de Ven


This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.

Please clarify how you think it does.



the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.

(trust me, that you really want, sequentially IPI's a pile of cores in a deep 
sleep
state to just flush a tlb that's empty, the performance of that is horrific)


Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure

2018-01-24 Thread Arjan van de Ven

On 1/24/2018 1:10 AM, Greg Kroah-Hartman wrote:



That means the whitelist ends up basically empty right now. Should I
add a command line parameter to override it? Otherwise we end up having
to rebuild the kernel every time there's a microcode release which
covers a new CPU SKU (which is why I kind of hate the whitelist, but
Arjan is very insistent...)


Ick, no, whitelists are a pain for everyone involved.  Don't do that
unless it is absolutely the only way it will ever work.

Arjan, why do you think this can only be done as a whitelist?


I suggested a minimum version list for those cpus that need it.

microcode versions are tricky (and we've released betas etc etc with their own 
numbers)
and as a result there might be several numbers that have those issues with 
their IBRS for the same F/M/S





Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process

2018-01-21 Thread Arjan van de Ven

On 1/21/2018 8:21 AM, Ingo Molnar wrote:



So if it's only about the scheduler barrier, what cycle cost are we talking 
about
here?



in the order of 5000 to 1 cycles.
(depends a bit on the cpu generation but this range is a reasonable 
approximation)




Because putting something like this into an ELF flag raises the question of who 
is
allowed to set the flag - does a user-compiled binary count? If yes then it 
would
be a trivial thing for local exploits to set the flag and turn off the barrier.


the barrier is about who you go TO, e.g. the thing under attack.
as you say, depending on the thing that would be the evil one does not work.



Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Arjan van de Ven

Does anybody have any other ideas?


the only other weird case that comes to mind; what happens if there's a line 
dirty in the caches,
but the memory is now mapped uncached. (Which could happen if kexec does muck 
with  MTRRs, CR0 or other similar
things in weird ways)... not sure what happens in CPU, a machine check for 
cache inclusion violations
is not beyond the imagination and might be lethal

this would explain a kexec specific angle versus general normal (but rare) use 
of wbinvd.


other weird case could be cached mmio (not common, but some gpus and the like 
can do it)
with iommu/VT-D in the middle, and during kexec VT-D shutting
down the iommu before the wbinvd. This would be... highly odd... but this 
report already is in highly odd space.


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Arjan van de Ven


Does anybody have any other ideas?


wbinvd is thankfully not common, but also not rare (MTRR setup and a bunch of 
other cases)
and in some other operating systems it happens even more than on Linux.. it's 
generally not totally broken like this.

I can only imagine a machine check case where a write back to a bad cell causes 
some parity error
or something...  but it's odd that no other machine checks are reported?
(can the user check for this please)


Re: [tip:x86/pti] x86/retpoline: Fill RSB on context switch for affected CPUs

2018-01-15 Thread Arjan van de Ven


This would means that userspace would see return predictions based
on the values the kernel 'stuffed' into the RSB to fill it.

Potentially this leaks a kernel address to userspace.


KASLR pretty much died in May this year to be honest with the KAISER paper (if 
not before then)

also with KPTI the address won't have a TLB mapping so it wouldn't
actually be speculated into.


Re: [PATCH 3/8] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest

2018-01-10 Thread Arjan van de Ven

On 1/10/2018 5:20 AM, Paolo Bonzini wrote:

* a simple specification that does "IBRS=1 blocks indirect branch
prediction altogether" would actually satisfy the specification just as
well, and it would be nice to know if that's what the processor actually
does.


it doesn't exactly, not for all.

so you really do need to write ibrs again.



Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU

2018-01-09 Thread Arjan van de Ven

On 1/9/2018 8:17 AM, Paolo Bonzini wrote:

On 09/01/2018 16:19, Arjan van de Ven wrote:

On 1/9/2018 7:00 AM, Liran Alon wrote:


- ar...@linux.intel.com wrote:


On 1/9/2018 3:41 AM, Paolo Bonzini wrote:

The above ("IBRS simply disables the indirect branch predictor") was my
take-away message from private discussion with Intel.  My guess is that
the vendors are just handwaving a spec that doesn't match what they have
implemented, because honestly a microcode update is unlikely to do much
more than an old-fashioned chicken bit.  Maybe on Skylake it does
though, since the performance characteristics of IBRS are so different
from previous processors.  Let's ask Arjan who might have more
information about it, and hope he actually can disclose it...


IBRS will ensure that, when set after the ring transition, no earlier
branch prediction data is used for indirect branches while IBRS is
set


Let me ask you my questions, which are independent of L0/L1/L2 terminology.

1) Is vmentry/vmexit considered a ring transition, even if the guest is
running in ring 0?  If IBRS=1 in the guest and the host is using IBRS,
the host will not do a wrmsr on exit.  Is this safe for the host kernel?


I think the CPU folks would want us to write the msr again.



2) How will the future processors work where IBRS should always be =1?


IBRS=1 should be "fire and forget this ever happened".
This is the only time anyone should use IBRS in practice
(and then the host turns it on and makes sure to not expose it to the guests I 
hope)


Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU

2018-01-09 Thread Arjan van de Ven



I'm sorry I'm not familiar with your L0/L1/L2 terminology
(maybe it's before coffee has had time to permeate the brain)


These are standard terminology for guest levels:
L0 == hypervisor that runs on bare-metal
L1 == hypervisor that runs as L0 guest.
L2 == software that runs as L1 guest.
(We are talking about nested virtualization here)


1. I really really hope that the guests don't use IBRS but use retpoline. At 
least for Linux that is going to be the prefered approach.

2. For the CPU, there really is only "bare metal" vs "guest"; all guests are 
"guests" no matter how deeply nested. So for the language of privilege domains etc,
nested guests equal their parent.


Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU

2018-01-09 Thread Arjan van de Ven

On 1/9/2018 7:00 AM, Liran Alon wrote:


- ar...@linux.intel.com wrote:


On 1/9/2018 3:41 AM, Paolo Bonzini wrote:

The above ("IBRS simply disables the indirect branch predictor") was

my

take-away message from private discussion with Intel.  My guess is

that

the vendors are just handwaving a spec that doesn't match what they

have

implemented, because honestly a microcode update is unlikely to do

much

more than an old-fashioned chicken bit.  Maybe on Skylake it does
though, since the performance characteristics of IBRS are so

different

from previous processors.  Let's ask Arjan who might have more
information about it, and hope he actually can disclose it...


IBRS will ensure that, when set after the ring transition, no earlier
branch prediction data is used for indirect branches while IBRS is
set


Consider the following scenario:
1. L1 runs with IBRS=1 in Ring0.
2. L1 restores L2 SPEC_CTRL and enters into L2.
3. L1 VMRUN exits into L0 which backups L1 SPEC_CTRL and enters L2 (using same 
VMCB).
4. L2 populates BTB/BHB with values and cause a hypercall which #VMExit into L0.
5. L0 backups L2 SPEC_CTRL and writes IBRS=1.
6. L0 restores L1 SPEC_CTRL and enters L1.
7. L1 backups L2 SPEC_CTRL and writes IBRS=1.



I'm sorry I'm not familiar with your L0/L1/L2 terminology
(maybe it's before coffee has had time to permeate the brain)




Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU

2018-01-09 Thread Arjan van de Ven

On 1/9/2018 3:41 AM, Paolo Bonzini wrote:

The above ("IBRS simply disables the indirect branch predictor") was my
take-away message from private discussion with Intel.  My guess is that
the vendors are just handwaving a spec that doesn't match what they have
implemented, because honestly a microcode update is unlikely to do much
more than an old-fashioned chicken bit.  Maybe on Skylake it does
though, since the performance characteristics of IBRS are so different
from previous processors.  Let's ask Arjan who might have more
information about it, and hope he actually can disclose it...


IBRS will ensure that, when set after the ring transition, no earlier
branch prediction data is used for indirect branches while IBRS is set

(this is a english summary of two pages of technical spec so it lacks
the language lawyer precision)

because of this promise, the implementation tends to be impactful
and it is very strongly recommended that retpoline is used instead of IBRS.
(with all the caveats already on lkml)

the IBPB is different, this is a covenient thing for switching between VM 
guests etc





Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-06 Thread Arjan van de Ven

It sounds like Coverity was used to produce these patches? If so, is
there a plan to have smatch (hey Dan) or other open source static
analysis tool be possibly enhanced to do a similar type of work?


I'd love for that to happen; the tricky part is being able to have even a
sort of sensible concept of "trusted" vs "untrusted" value...

if you look at a very small window of code, that does not work well;
you likely need to even look (as tool) across .c file boundaries




Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven

On 7/20/2017 1:11 AM, Thomas Gleixner wrote:

On Thu, 20 Jul 2017, Li, Aubrey wrote:

Don't get me wrong, even if a fast path is acceptable, we still need to
figure out if the coming idle is short and when to switch. I'm just worried
about if irq timings is not an ideal statistics, we have to skip it too.


There is no ideal solution ever.

Lets sit back and look at that from the big picture first before dismissing
a particular item upfront.

The current NOHZ implementation does:

predict = nohz_predict(timers, rcu, arch, irqwork);

if ((predict - now) > X)
stop_tick()

The C-State machinery does something like:

predict = cstate_predict(next_timer, scheduler);

cstate = cstate_select(predict);

That disconnect is part of the problem. What we really want is:

predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings);


two separate predictors is clearly a recipe for badness.

(likewise, C and P states try to estimate "performance sensitivity" and 
sometimes estimate in opposite directions)



to be honest, performance sensitivity estimation is probably 10x more critical 
for C state
selection than idle duration; a lot of modern hardware will do the energy 
efficiency stuff
in a microcontroller when it coordinates between multiple cores in the system 
on C and P states.

(both x86 and ARM have such microcontroller nowadays, at least for the higher 
performance
designs)



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven

On 7/20/2017 5:50 AM, Paul E. McKenney wrote:

To make this work reasonably, you would also need some way to check for
the case where the prediction idle time is short but the real idle time
is very long.


so the case where you predict very short but is actually "indefinite", the real
solution likely is that we set a timer some time in the future
(say 100msec, or some other value that is long but not indefinite)
where we wake up the system and make a new prediction,
since clearly we were insanely wrong in the prediction and should try
again.

that or we turn the prediction from a single value into a range of
(expected, upper bound)

where upper bound is likely the next timer or other going-to-happen events.






Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven

On 7/18/2017 9:36 AM, Peter Zijlstra wrote:

On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote:


the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)


C1-lite would then have a threshold < C1, whereas I understood the
desire to be for the fast-idle crud to have a larger threshold than C1
currently has.

That is, from what I understood, they want C1 selected *longer*.


that's just a matter of fixing the C1 and later thresholds to line up right.

shrug that's the most trivial thing to do, it's a number in a table.

some distros do those tunings anyway when they don't like the upstream tunings



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven

On 7/18/2017 8:20 AM, Paul E. McKenney wrote:

3.2) how to determine if the idle is short or long. My current proposal is to
use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
didn't get the details of an auto-adjust mechanism yet



the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)

this C1-lite would then skip some of the idle steps like the nohz logic. How we
plumb that ... might end up being a flag or whatever, we'll figure that out 
easily.

as long as "real C1" has a break even time that is appropriate compared to 
C1-lite,
we'll only pick C1-lite for very very short idles like is desired...
but we don't end up creating a parallel infra for picking states, that part 
just does
not make sense to me tbh I have yet to see any reason why C1-lite couldn't 
be just
another C-state for everything except the actual place where we do the "go 
idle" last
bit of logic.

(Also note that for extreme short idles, today we just spinloop (C0), so by 
this argument
we should also do a C0-lite.. or make this C0 always the lite variant)






Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven

On 7/17/2017 12:53 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually


Predicted by what?


this chart was with the current linux predictor

http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next 
timer ;-)
(which way back linux was doing)



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven

On 7/17/2017 12:46 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)


next tick is the worst thing to look at for interrupt heavy workloads as


well it was better than what was there before (without discount and without 
detecting
repeated patterns)


the next tick (as computed by the nohz code) can be far away, while the I/O
interrupts come in at a high frequency.

That's where Daniel Lezcanos work of predicting interrupts comes in and
that's the right solution to the problem. The core infrastructure has been
merged, just the idle/cpufreq users are not there yet. All you need to do
is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.



yes ;-)

also note that the predictor does not need to perfect, on most systems C states 
are
an order of magnitude apart in terms of power/performance/latency so if you get 
the general
order of magnitude right the predictor is doing its job.

(this is not universally true, but physics of power gating/etc tend to drive to 
this conclusion;
the cost of implementing an extra state very close to another state means that 
the HW folks are unlikely
to do the less power saving state of the two to save their cost and testing 
effort)



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually

http://git.fenrus.org/tmp/linux2.png



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)




Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Arjan van de Ven

On 7/14/2017 8:38 AM, Peter Zijlstra wrote:

No, that's wrong. We want to fix the normal C state selection process to
pick the right C state.

The fast-idle criteria could cut off a whole bunch of available C
states. We need to understand why our current C state pick is wrong and
amend the algorithm to do better. Not just bolt something on the side.


I can see a fast path through selection if you know the upper bound of any
selection is just 1 state.

But also, how much of this is about "C1 be fast" versus "selecting C1 is slow"

a lot of the patches in the thread seem to be about making a lighter/faster C1,
which is reasonable (you can even argue we might end up with 2 C1s, one fast
one full feature)




Re: [x86/mm] e2a7dcce31: kernel_BUG_at_arch/x86/mm/tlb.c

2017-05-30 Thread Arjan van de Ven

On 5/27/2017 9:56 AM, Andy Lutomirski wrote:

On Sat, May 27, 2017 at 9:00 AM, Andy Lutomirski  wrote:

On Sat, May 27, 2017 at 6:31 AM, kernel test robot
 wrote:


FYI, we noticed the following commit:

commit: e2a7dcce31f10bd7471b4245a6d1f2de344e7adf ("x86/mm: Rework lazy TLB to track 
the actual loaded mm")
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git x86/tlbflush_cleanup


Ugh, there's an unpleasant interaction between this patch and
intel_idle.  I suspect that the intel_idle code in question is either
wrong or pointless, but I want to investigate further.  Ingo, can you
hold off on applying this patch?


I think this is what's going on: intel_idle has an optimization and
sometimes calls leave_mm().  This is a rather expensive way of working
around x86 Linux's fairly weak lazy mm handling.  It also abuses the
whole switch_mm state machine.  In particular, there's no guarantee
that the mm is actually lazy at the time.  The old code didn't care,
but the new code can oops.

The short-term fix is to just reorder the code in leave_mm() to avoid the OOPS.


fwiw the reason the code is in intel_idle is to avoid tlb flush IPIs to idle 
cpus,
once the cpu goes into a deep enough idle state.  In the current linux code,
that is done by no longer having the old TLB live on the CPU, by switching to 
the neutral
kernel-only set of tlbs.

If your proposed changes do that (avoid the IPI/wakeup), great!
(if not, there should be a way to do that)



Re: [patch 12/18] async: Adjust system_state checks

2017-05-14 Thread Arjan van de Ven

On 5/14/2017 11:27 AM, Thomas Gleixner wrote:

looks good .. ack



Re: [PATCH] use get_random_long for the per-task stack canary

2017-05-04 Thread Arjan van de Ven

On 5/4/2017 6:32 AM, Daniel Micay wrote:

The stack canary is an unsigned long and should be fully initialized to
random data rather than only 32 bits of random data.


that makes sense to me... ack



Re: [PATCH 5/6] notifiers: Use CHECK_DATA_CORRUPTION() on checks

2017-03-22 Thread Arjan van de Ven

On 3/22/2017 12:29 PM, Kees Cook wrote:

When performing notifier function pointer sanity checking, allow
CONFIG_BUG_ON_DATA_CORRUPTION to upgrade from a WARN to a BUG.
Additionally enables CONFIG_DEBUG_NOTIFIERS when selecting
CONFIG_BUG_ON_DATA_CORRUPTION.



Any feedback on this change? By default, this retains the existing
WARN behavior...


if you're upgrading, is the end point really a panic() ?
e.g. do you assume people to also set panic-on-oops?




Re: [PATCH 1/5] x86: Implement __WARN using UD0

2017-03-21 Thread Arjan van de Ven

On 3/21/2017 8:14 AM, Peter Zijlstra wrote:

For self-documentation purposes, maybe use a define for the length of
the ud0 instruction?


#define TWO 2

;-)


some things make sense as a define, others don't
(adding a comment, maybe)



Re: [PATCH] x86/dmi: Switch dmi_remap to ioremap_cache

2017-03-09 Thread Arjan van de Ven

On 3/9/2017 9:48 AM, Julian Brost wrote:


I'm note entirely sure whether it's actually the kernel or HP to blame,
but for now, hp-health is completely broken on 4.9 (probably on
everything starting from 4.6), so this patch should be reviewed again.


it looks like another kernel driver is doing a conflicting mapping.
do these HP tools come with their own kernel drivers or are those in the
upstream kernel nowadays?



Re: [PATCH] x86: Implement __WARN using UD0

2017-02-23 Thread Arjan van de Ven

On 2/23/2017 5:28 AM, Peter Zijlstra wrote:


By using "UD0" for WARNs we remove the function call and its possible
__FILE__ and __LINE__ immediate arguments from the instruction stream.

Total image size will not change much, what we win in the instruction
stream we'll loose because of the __bug_table entries. Still, saves on
I$ footprint and the total image size does go down a bit.


well I am a little sceptical; WARNs are rare so the code (other than the test)
should be waaay out of line already (unlikely() and co).
And I assume you're not removing the __FILE__ and __LINE__ info, since that info
is actually high value for us developers... so what are you actually saving?

(icache saving is only real if the line that the cold code lives on would 
actually
end up in icache for other reasons; I would hope the compiler puts the out
of line code WAY out of line)




Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location

2017-01-05 Thread Arjan van de Ven

On 1/5/2017 9:54 AM, Thomas Garnier wrote:



That's my goal too. I started by doing a RO remap and got couple
problems with hibernation. I can try again for the next iteration or
delay it for another patch. I also need to look at KVM GDT usage, I am
not familiar with it yet.


don't we write to the GDT as part of the TLS segment stuff for glibc ?



Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location

2017-01-05 Thread Arjan van de Ven

On 1/5/2017 8:40 AM, Thomas Garnier wrote:

Well, it happens only when KASLR memory randomization is enabled. Do
you think it should have a separate config option?


no I would want it a runtime option "sgdt from ring 3" is going away
with UMIP (and is already possibly gone in virtual machines, see
https://lwn.net/Articles/694385/) and for those cases it would be a shame
to lose the randomization



Re: [PATCH 1/3] cpuidle/menu: stop seeking deeper idle if current state is too deep

2017-01-05 Thread Arjan van de Ven

On 1/5/2017 7:43 AM, Rik van Riel wrote:

On Thu, 2017-01-05 at 23:29 +0800, Alex Shi wrote:

The obsolete commit 71abbbf85 want to introduce a dynamic cstates,
but it was removed for long time. Just left the nonsense deeper
cstate
checking.

Since all target_residency and exit_latency are going longer in
deeper
idle state, no needs to waste some cpu cycle on useless seeking.


Makes me wonder if it would be worth documenting the
requirement that c-states be listed in increasing
order?


or better, a boot time quick check...



Re: [RFC] x86/mm/KASLR: Remap GDTs at fixed location

2017-01-05 Thread Arjan van de Ven

On 1/5/2017 12:11 AM, Ingo Molnar wrote:


* Thomas Garnier  wrote:


Each processor holds a GDT in its per-cpu structure. The sgdt
instruction gives the base address of the current GDT. This address can
be used to bypass KASLR memory randomization. With another bug, an
attacker could target other per-cpu structures or deduce the base of the
main memory section (PAGE_OFFSET).

In this change, a space is reserved at the end of the memory range
available for KASLR memory randomization. The space is big enough to hold
the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
mapped at specific offset based on the target CPU. Note that if there is
not enough space available, the GDTs are not remapped.

The document was changed to mention GDT remapping for KASLR. This patch
also include dump page tables support.

This patch was tested on multiple hardware configurations and for
hibernation support.



 void kernel_randomize_memory(void);
+void kernel_randomize_smp(void);
+void* kaslr_get_gdt_remap(int cpu);


Yeah, no fundamental objections from me to the principle, but I get some bad 
vibes
from the naming here: seeing that kernel_randomize_smp() actually makes things
less random.



kernel_unrandomize_smp() ...

one request.. can we make sure this unrandomization is optional?



Re: [PATCH] proc: Fix timerslack_ns CAP_SYS_NICE check when adjusting self

2016-08-10 Thread Arjan van de Ven

On 8/10/2016 12:03 PM, John Stultz wrote:


I wasn't entierly sure. I didn't think PR_SET_TIMERSLACK has a
security hook, but looking again I now see the top-level
security_task_prctl() check, so maybe not skipping it in this case
would be good?


the easy fix would be to add back the ptrace check.. just either ptrace-able OR 
CAP_SYS_NICE ;)
then you can prove you only added new stuff as well, and have all the LSM from 
before



Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface

2016-07-14 Thread Arjan van de Ven

On 7/14/2016 10:45 AM, Kees Cook wrote:

On Thu, Jul 14, 2016 at 9:09 AM, John Stultz  wrote:

On Thu, Jul 14, 2016 at 5:48 AM, Serge E. Hallyn  wrote:

Quoting Kees Cook (keesc...@chromium.org):

I think the original CAP_SYS_NICE should be fine. A malicious
CAP_SYS_NICE process can do plenty of insane things, I don't feel like
the timer slack adds to any realistic risks.


Can someone give a detailed explanation of what you could do with
the new timerslack feature and compare it to what you can do with
sys_nice?


Looking at the man page for CAP_SYS_NICE, it looks like such a task
can set a task as SCHED_FIFO, so they could fork some spinning
processes and set them all SCHED_FIFO 99, in effect delaying all other
tasks for an infinite amount of time.

So one might argue setting large timerslack vlaues isn't that
different risk wise?


Right -- you can hose a system with CAP_SYS_NICE already; I don't
think timerslack realistically changes that.


fair enough

the worry of being able to time attack things is there already with the 
SCHED_FIFO
so... purist objection withdrawn in favor of the pragmatic


Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface

2016-07-14 Thread Arjan van de Ven

On 7/14/2016 5:48 AM, Serge E. Hallyn wrote:


Can someone give a detailed explanation of what you could do with
the new timerslack feature and compare it to what you can do with
sys_nice?



what you can do with the timerslack feature is add upto 4 seconds of extra
time/delay on top of each select()/poll()/nanosleep()/... (basically anything 
that
uses hrtimers on behalf of the user), and then also control within that
4 second window exactly when that extra delay ends
(which may help a timing attack kind of scenario)




Re: [PATCH 2/2] proc: Add /proc//timerslack_ns interface

2016-07-13 Thread Arjan van de Ven

On 7/13/2016 8:39 PM, Kees Cook wrote:


So I worry I'm a bit stuck here. For general systems, CAP_SYS_NICE is
too low a level of privilege  to set a tasks timerslack, but
apparently CAP_SYS_PTRACE is too high a privilege for Android's
system_server to require just to set a tasks timerslack value.

So I wanted to ask again if we might consider backing this down to
CAP_SYS_NICE, or if we can instead introduce a new CAP_SYS_TIMERSLACK
or something to provide the needed in-between capability level.


Adding new capabilities appears to not really be viable (lots of
threads about this...)

I think the original CAP_SYS_NICE should be fine. A malicious
CAP_SYS_NICE process can do plenty of insane things, I don't feel like
the timer slack adds to any realistic risks.


if the result is really as bad as you describe, then that is worse than
the impact of this being CAP_SYS_NICE, and thus SYS_TRACE is maybe the
purist answer, but not the pragmatic best answer; certainly I don't want
to make the overall system security worse.

I wonder how much you want to set the slack; one of the options (and I don't
know how this will work in the code, if it's horrible don't do it)
is to limit how much slack CAP_SYS_NICE can set (say, 50 or 100 msec, e.g. in 
the order
of a "time slice" or two if Linux had time slices, similar to what nice would 
do)
while CAP_SYS_TRACE  can set the full 4 seconds.
If it makes the code horrible, don't do it and just do SYS_NICE.






Re: [PATCH 1/8] x86: don't use module.h just for AUTHOR / LICENSE tags

2016-07-13 Thread Arjan van de Ven

On 7/13/2016 5:18 PM, Paul Gortmaker wrote:

The Kconfig controlling compilation of these files are:

arch/x86/Kconfig.debug:config DEBUG_RODATA_TEST
arch/x86/Kconfig.debug: bool "Testcase for the marking rodata read-only"

arch/x86/Kconfig.debug:config X86_PTDUMP_CORE
arch/x86/Kconfig.debug: def_bool n

...meaning that it currently is not being built as a module by anyone.

Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.

We delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.

Cc: Arjan van de Ven 


Acked-by: Arjan van de Ven 

original these were tested as modules, but they really shouldn't be modules
in the normal kernel (and aren't per Kconfig)





Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Arjan van de Ven
On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek  wrote:
>
> Umm. I'm not sure if you should be designing kernel...
>
> I have alarm clock application. It does sleep(60) many times till its
> time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
> without some very, very good reason.

I'm fairly sure you shouldn't be designing alarm clock applications!
Because on busy systems you get random (scheduler) delays added to your timer.

Having said that, your example is completely crooked here, sleep()
does not use these kernel timers, it uses hrtimers instead.
(hrtimers also have slack, but an alarm clock application that is this
broken would have the choice to set such slack to 0)

What happened here is that these sigtimewait were actually not great,
it is just about the only application visible interface that's still
in jiffies/HZ,
and in the follow-on patch set, Thomas converted them properly to
hrtimers as well to make them both accurate and CONFIG_HZ independent.


Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven
so is there really an issue? sounds like KISS principle can apply

On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixner  wrote:
> On Mon, 20 Jun 2016, Arjan van de Ven wrote:
>> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>> >
>> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 
>> > 1000HZ
>> >option so datacenter folks can use this and people who don't care and 
>> > want
>> >better batching for power can use the 4ms thingy.
>>
>>
>> if there really is one user of such long timers... could we possibly
>> make that one robust against early fire of the timer?
>>
>> eg rule is: if you set timers > 37 hours, you need to cope with early timer 
>> fire
>
> The only user I found is networking contrack (5 days). Eric thought its not a
> big problem if it fires earlier.
>
> Thanks,
>
> tglx
>


Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven
On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>
> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
>option so datacenter folks can use this and people who don't care and want
>better batching for power can use the 4ms thingy.


if there really is one user of such long timers... could we possibly
make that one robust against early fire of the timer?

eg rule is: if you set timers > 37 hours, you need to cope with early timer fire


Re: initialize a mutex into locked state?

2016-06-17 Thread Arjan van de Ven

On 6/17/2016 7:54 AM, Oleg Drokin wrote:


Yes, we can add all sorts of checks that have various impacts on code 
readability,
we can also move code around that also have code readability and CPU impact.

But in my discussion with Arjan he said this is a new use case that was not met 
before
and suggested to mail it to the list.


I'm all in favor of having "end code" be as clear as possible wrt intent.
(and I will admit this is an curious use case, but not an insane silly one)

one other option is to make a wrapper

mutex_init_locked( )
{
mutex_init()
mutex_trylock()
}

that way the wrapper can be an inline in a header, but doesn't need to touch a 
wide
berth of stuff... while keeping the end code clear wrt intent



Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Arjan van de Ven
>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.


the only likely problem cases are msleep(1) uses... but we could just
map those to usleep(1000,2000)

(imo we should anyway)


Re: [patch 13/20] timer: Switch to a non cascading wheel

2016-06-16 Thread Arjan van de Ven
I think there's 2 elements on the interface.

1) having a relative interface to the current time (avoid use of
absolute jiffies in drivers)

2) having wallclock units. Making HZ always be 1000 is effectively
doing that as well (1 msec after all)



On Thu, Jun 16, 2016 at 8:43 AM, Thomas Gleixner  wrote:
> On Wed, 15 Jun 2016, Thomas Gleixner wrote:
>> On Wed, 15 Jun 2016, Arjan van de Ven wrote:
>> > what would 1 more timer wheel do?
>>
>> Waste storage space and make the collection of expired timers more expensive.
>>
>> The selection of the timer wheel properties is combination of:
>>
>> 1) Granularity
>>
>> 2) Storage space
>>
>> 3) Number of levels to collect
>
> So I came up with a slightly different solution for this. The problem case is
> HZ=1000 and again looking at the data, there is no reason why we need actual
> 1ms granularity for timer wheel timers. That's independent of the desired ms
> based interfaces.
>
> We can simply run the wheel internaly with 4ms base level resolution and
> degrade from there. That gives us 6 days+ and a simple cutoff at the capacity
> of the 7th level wheel.
>
>  0 04 ms   0 ms -255 ms
>  164   32 ms 256 ms -   2047 ms (256ms - ~2s)
>  2   128  256 ms2048 ms -  16383 ms (~2s - ~16s)
>  3   192 2048 ms (~2s) 16384 ms - 131071 ms (~16s - ~2m)
>  4   25616384 ms (~16s)   131072 ms -1048575 ms (~2m - ~17m)
>  5   320   131072 ms (~2m)   1048576 ms -8388607 ms (~17m - ~2h)
>  6   384  1048576 ms (~17m)  8388608 ms -   67108863 ms (~2h - ~18h)
>  7   448  8388608 ms (~2h)  67108864 ms -  536870911 ms (~18h - ~6d)
>
> That works really nice and has the interesting side effect that we batch in
> the first level wheel which helps networking. I'll repost the series with the
> other review points addressed later tonight.
>
> Btw, I also thought a bit more about the milliseconds interfaces. I think we
> shouldn't invent new interfaces. The correct solution IMHO is to distangle the
> scheduler tick frequency and jiffies. If we have that completely seperated
> then we can do the following:
>
> 1) Force HZ=1000. That means jiffies and timer wheel units are 1ms. If the
>tick frequency is != 1000 we simply increment jiffies in the tick by the
>proper amount (4 @250 ticks/sec, 10 @100 ticks/sec).
>
>So all msec_to_jiffies() invocations compile out into nothing magically and
>we can remove them gradually over time.
>
> 2) When we do that right, we can make the tick frequency a command line option
>and just have a compiled in default.
>
> Thoughts?
>
> Thanks,
>
> tglx


Re: [patch 13/20] timer: Switch to a non cascading wheel

2016-06-15 Thread Arjan van de Ven
what would 1 more timer wheel do?

On Wed, Jun 15, 2016 at 7:53 AM, Thomas Gleixner  wrote:
> On Tue, 14 Jun 2016, Eric Dumazet wrote:
>> Original TCP RFCs tell timeout is infinite ;)
>>
>> Practically, conntrack has a 5 days timeout, but I really doubt anyone
>> expects an idle TCP flow to stay 'alive' when nothing is sent for 5
>> days.
>
> So would 37hrs ~= 1.5 days be a reasonable cutoff or will stuff fall apart and
> people be surprised?
>
> Thanks,
>
> tglx
>


Re: [patch 13/20] timer: Switch to a non cascading wheel

2016-06-14 Thread Arjan van de Ven
evaluating a 120 hours timer ever 37 hours to see if it should fire...
not too horrid.

On Tue, Jun 14, 2016 at 9:28 AM, Thomas Gleixner  wrote:
> On Tue, 14 Jun 2016, Ingo Molnar wrote:
>> * Thomas Gleixner  wrote:
>> > On Mon, 13 Jun 2016, Peter Zijlstra wrote:
>> > > On Mon, Jun 13, 2016 at 08:41:00AM -, Thomas Gleixner wrote:
>> > > > +
>> > > > +   /* Cascading, sigh... */
>> > >
>> > > So given that userspace has no influence on timer period; can't we
>> > > simply fail to support timers longer than 30 minutes?
>> > >
>> > > In anything really arming timers _that_ long?
>> >
>> > Unfortunately yes. Networking being one of those. Real cascading happens 
>> > once
>> > in a blue moon, but it happens.
>>
>> So I'd really prefer it if we added a few more levels, a hard limit and got 
>> rid of
>> the cascading once and for all!
>>
>> IMHO 'once in a blue moon' code is much worse than a bit more data overhead.
>
> I agree. If we add two wheel levels then we end up with:
>
>   HZ 1000:  134217727 ms ~=  37 hours
>   HZ  250:  536870908 ms ~= 149 hours
>   HZ  100: 1342177270 ms ~= 372 hours
>
> Looking through all my data I found exactly one timeout which is insanely
> large: 120 hours!
>
> That's net/netfilter/nf_conntrack_core.c:
>   setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
>
> Anything else is way below 37 hours.
>
> Thanks,
>
> tglx


Re: [patch 04/20] cpufreq/powernv: Initialize timer as pinned

2016-06-13 Thread Arjan van de Ven
On Mon, Jun 13, 2016 at 1:40 AM, Thomas Gleixner  wrote:
> mod_timer(&gpstates->timer, jiffies + msecs_to_jiffies(timer_interval));

are you sure this is right? the others did not get replaced by mod_timer()..
(and this is more evidence that a relative API in msecs is what
drivers really want)


Re: [patch 06/20] drivers/tty/metag_da: Initialize timer as pinned

2016-06-13 Thread Arjan van de Ven
I know it's not related to this patch, but it'd be nice to, as you're
changing the api name anyway, make a mod_pinned_relative() so that
more direct users of jiffies can go away...
or even better, mod_pinned_relative_ms() so that these drivers also do
not need to care about HZ.

On Mon, Jun 13, 2016 at 1:40 AM, Thomas Gleixner  wrote:
> Pinned timers must carry that attribute in the timer itself. No functional
> change.
>
> Signed-off-by: Thomas Gleixner 
> ---
>  drivers/tty/metag_da.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: b/drivers/tty/metag_da.c
> ===
> --- a/drivers/tty/metag_da.c
> +++ b/drivers/tty/metag_da.c
> @@ -323,12 +323,12 @@ static void dashtty_timer(unsigned long
> if (channel >= 0)
> fetch_data(channel);
>
> -   mod_timer_pinned(&poll_timer, jiffies + DA_TTY_POLL);
> +   mod_pinned(&poll_timer, jiffies + DA_TTY_POLL);
>  }
>
>  static void add_poll_timer(struct timer_list *poll_timer)
>  {
> -   setup_timer(poll_timer, dashtty_timer, 0);
> +   setup_pinned_timer(poll_timer, dashtty_timer, 0);
> poll_timer->expires = jiffies + DA_TTY_POLL;
>
> /*
>
>


Re: S3 resume regression [1cf4f629d9d2 ("cpu/hotplug: Move online calls to hotplugged cpu")]

2016-05-11 Thread Arjan van de Ven



Oh, and this was with acpi_idle. This machine already failed to
resume from S3 with intel_idle since forever, as detailed in
https://bugzilla.kernel.org/show_bug.cgi?id=107151
but acpi_idle worked fine until now.


can you disable (in sysfs) all C states other than C0/C1 and see if that makes 
it go away?
that would point at the problem pretty clearly...




Re: S3 resume regression [1cf4f629d9d2 ("cpu/hotplug: Move online calls to hotplugged cpu")]

2016-05-11 Thread Arjan van de Ven

On 5/11/2016 3:19 AM, Ville Syrjälä wrote:


Oh, and this was with acpi_idle. This machine already failed to
resume from S3 with intel_idle since forever, as detailed in
https://bugzilla.kernel.org/show_bug.cgi?id=107151
but acpi_idle worked fine until now.


this is the important clue part afaics.

some of these very old Atom's had issues (bios?) with S3 if the cores were in a 
too-deep C state,
and at some point there was a workaround (I forgot where in the code) to ban 
those deep
C states around S3 on those cpus. I wonder if moving things around has made 
said workaround
ineffective.



Re: [PATCH v5 3/9] x86/head: Move early exception panic code into early_fixup_exception

2016-04-04 Thread Arjan van de Ven

On 4/4/2016 8:32 AM, Andy Lutomirski wrote:


Adding locking would be easy enough, wouldn't it?

But do any platforms really boot a second CPU before switching to real
printk?  Given that I see all the smpboot stuff in dmesg, I guess real
printk happens first.  I admit I haven't actually checked.


adding locking also makes things more fragile in terms of getting the last 
thing out
before you go down in flaming death

until it's a proven problem, this early, get the message out at all is more 
important
than getting it out perfectly, sometimes.




Re: [PATCH] x86: Enable full randomization on i386 and X86_32.

2016-03-10 Thread Arjan van de Ven

Arjan, or other folks, can you remember why x86_32 disabled mmap
randomization here? There doesn't seem to be a good reason for it that
I see.


for unlimited stack it got really messy with threaded apps.

anyway, I don't mind seeing if this will indeed work, with time running out
where 32 bit is going extinct... in a few years we just won't have enough
testing on this kind of change anymore.




Re: [PATCH v2] arm64: add alignment fault hanling

2016-02-16 Thread Arjan van de Ven

On 2/16/2016 10:50 AM, Linus Torvalds wrote:

On Tue, Feb 16, 2016 at 9:04 AM, Will Deacon  wrote:

[replying to self and adding some x86 people]

Background: Euntaik reports a problem where userspace has ended up with
a memory page mapped adjacent to an MMIO page (e.g. from /dev/mem or a
PCI memory bar from someplace in /sys). strncpy_from_user happens with
the word-at-a-time implementation, and we end up reading into the MMIO
page.


how does this work if the adjacent page is not accessible?
or has some other magic fault handler, or is on an NFS filesystem where
the server is rebooting?

isn't the general rule for such basic functions "don't touch memory unless you KNOW 
it is there"




Of course, no actual real program will do that for mixing MMIO and
non-MMIO, and so we might obviously add code to always add a guard
page for the normal case when a specific address isn't asked for. So
as a heuristic to make sure it doesn't happen by mistake it possibly
makes sense.


but what happens to the read if the page isn't present?
or is execute-only or .. or ..




Re: [PATCH] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.

2016-02-05 Thread Arjan van de Ven

and most of the RT guys would only tolerate a little bit of it

is there any real/practial use of going longer than 4 seconds? if there
is then yeah fixing it makes sense.
if it's just theoretical... shrug... 32 bit systems have a bunch of
other limits/differences a well.


So I'd think it would be mostly theoretical, but in my testing on a
VM, setting the timerslack for bash to 10 secs made time sleep 1 take
~10.5 seconds. So its apparently not too hard to coalesce fairly far
out (I need to spend a bit more time to verify that events really
weren't happening during that time and we're not just doing
unnecessary delays with the extra slack).


99% sure you're hitting something else;
we look pretty much only 1 ahead in the queue for timers to run to see if
they can be run, once we hit a timer that's not ready yet we stop.
your 10 second ahead is behind a whole bunch of other not-ready ones
so won't even be looked at until its close



But yea. My main concern is that if we do a consistent 64bit interface
for all arches in the /proc//timerslack_ns interface, it will
make PR_GET_TIMERSLACK return incorrect results on 32bit systems when
the slack is >= 2^32.


or we return UINT_MAX for that case. not too hard.



Re: [PATCH] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.

2016-02-05 Thread Arjan van de Ven

On 2/5/2016 4:51 PM, John Stultz wrote:

On Fri, Feb 5, 2016 at 2:35 PM, John Stultz  wrote:

On Fri, Feb 5, 2016 at 12:50 PM, Andrew Morton
 wrote:

On Fri, 5 Feb 2016 12:44:04 -0800 Kees Cook  wrote:

Could this be exposed as a writable /proc entry instead? Like the oom_* stuff?


/proc//timer_slack_ns, guarded by ptrace_may_access(), documented
under Documentation/?  Yup, that would work.  It's there for all
architectures from day one and there is precedent.  It's not as nice,
but /proc nasties will always be with us.


Ok. I'll start working on that.


Arjan/Thomas:  One curious thing I noticed here while writing some
documentation. The timer_slack_ns value in the task struct is a
unsigned long.

So this means PR_SET_TIMERSLACK limits the maximum slack on 32 bit
machines to ~4 seconds. Where on 64bit machines it can be quite a bit
longer (unreasonably long, really :).


originally when we created timerslack, 4 seconds was an eternity and good 
enough for everyone
by a mile... (assumption was practical upper limit being in the 15 msec range)
and most of the RT guys would only tolerate a little bit of it

is there any real/practial use of going longer than 4 seconds? if there
is then yeah fixing it makes sense.
if it's just theoretical... shrug... 32 bit systems have a bunch of
other limits/differences a well.



Re: [RFC][PATCH v2] prctl: Add PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.

2016-01-26 Thread Arjan van de Ven

On 1/25/2016 8:28 PM, John Stultz wrote:

From: Ruchi Kandoi 

This allows power/performance management software to set timer
slack for other threads according to its policy for the thread
(such as when the thread is designated foreground vs. background
activity)

Second argument is similar to PR_SET_TIMERSLACK, if non-zero
then the slack is set to that value otherwise sets it to the
default for the thread.

Takes PID of the thread as the third argument.

This interface checks that the calling task has permissions to
to use PTRACE_MODE_ATTACH on the target task, so that we can
ensure arbitrary apps do not change the timer slack for other
apps.


Acked-by: Arjan van de Ven 

only slight concern is the locking around the value of the field in the task 
struct,
but nobody does read-modify-write on it, so they'll get either the new or the 
old version,
which should be ok.

(until now only the local thread would touch the field, and if you're setting 
it, by definition
you're not going to sleep yet, so you're not using the field)




Re: 4.4-rc5: ugly warn on: 5 W+X pages found

2015-12-15 Thread Arjan van de Ven

On 12/14/2015 11:56 PM, Pavel Machek wrote:

On Mon 2015-12-14 13:24:08, Arjan van de Ven wrote:



That's weird.  The only API to do that seems to be manually setting
kmap_prot to _PAGE_KERNEL_EXEC, and nothing does that.  (Why is
kmap_prot a variable on x86 at all?  It has exactly one writer, and
that's the code that initializes it in the first place.  Shouldn't we
#define kmap_prot _PAGE_KERNEL?


iirc it changes based on runtime detection of NX capability


Huh. Is it possible that core duo is so old that it has no NX?


really stupid question I guess, but is PAE on ?
(64 bit pagetables are required for NX)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.4-rc5: ugly warn on: 5 W+X pages found

2015-12-14 Thread Arjan van de Ven



That's weird.  The only API to do that seems to be manually setting
kmap_prot to _PAGE_KERNEL_EXEC, and nothing does that.  (Why is
kmap_prot a variable on x86 at all?  It has exactly one writer, and
that's the code that initializes it in the first place.  Shouldn't we
#define kmap_prot _PAGE_KERNEL?


iirc it changes based on runtime detection of NX capability

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] sched: introduce synchronized idle injection

2015-11-18 Thread Arjan van de Ven

On 11/18/2015 7:44 AM, Morten Rasmussen wrote:

I would not necessarily want to punish all cpus
system-wide if we have local overheating in one corner. If would rather
have it apply to only the overheating socket in a multi-socket machine
and only the big cores in a big.LITTLE system.


most of the time thermal issues aren't inside the SOC, but on a system level
due to cheap heat spreaders or outright lack of space due to thinness. But
even if you have one part of the die too hot:

For core level idle injection, no need to synchronize that; the reason to 
synchronize
is generally that when ALL cores are idle, additional power savings kick in
(like memory going to self refresh, fabrics power gating etc); those additional
power savings are what makes this more efficient than just voltage/frequency
scaling at the bottom of that range...   not so much the fact that things are 
just idle.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] sched: introduce synchronized idle injection

2015-11-18 Thread Arjan van de Ven

On 11/18/2015 12:36 AM, Ingo Molnar wrote:


What will such throttling do to latencies, as observed by user-space tasks? 
What's
the typical expected frequency of the throttling frequency that you are 
targeting?


for this to meaningfully reduce power consumption, deep system power states 
need to be reached,
and for those to pay of, several milliseconds of idle are required.

for hard realtime stuff that is obviously insane, but I would assume that for 
those cases
your system is thermally fine. This only kicks in at the end of a "we're in 
thermal
problems" path, which can happen both on clients (thin devices) as well as 
servers
(airconditioning issues). The objective for this is to kick in before the
hardware built-in protections kick in (which are power off/reboot depending on 
a bunch of things).
The frequency of how often these 5 msec get injected depend on how deep the 
system
is in trouble; and is zero if the system is not in trouble.

The idea is that for the user it is better to inject several 5 msec intervals 
than it is
to inject one longer period.


You can compare this method to other ways of reducing thermal issues (like 
lowering cpu frequency),
and in a typical setup, this is done after the benign of those methods are 
exhausted.
Lowering frequency even lower is usually of a low efficiency (you need to lower 
the frequency a LOT
to gain a little bit of power in the bottom parts of the frequency ranges), 
while this idle will
not only put the CPU in low power, but will also put the system memory in low 
power and usually
a big chunk of the rest of the SOC. In many client systems, memory power 
consumption is higher than CPU
power consumption (and in big servers, it's also quite sizable), so there is a 
pretty hard
limit of how much you can do on thermals if you're not also kicking some of the 
memory power savings.
This means that to achieve a certain amount of reduction, the performance is 
impacted a lot less than
the more drastic methods you would need on the cpu side, if possible at all.
(stepping your 2 Ghz cpu down to 50Mhz may sound less evil than injecting 5msec 
of idle time, but in reality
that is impacting user tasks a heck of a lot more than 5msec of not being 
scheduled)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] timer: relax tick stop in idle entry

2015-11-16 Thread Arjan van de Ven

On 11/16/2015 6:53 PM, Paul E. McKenney wrote:

Fair point.  When in the five-jiffy throttling state, what can wake up
a CPU?  In an earlier version of this proposal, the answer was "nothing",
but maybe that has changed.


device interrupts are likely to wake the cpus.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] timer: relax tick stop in idle entry

2015-11-16 Thread Arjan van de Ven

On 11/16/2015 3:28 PM, Paul E. McKenney wrote:


Is this mostly an special-purpose embedded thing, or do you expect distros
to be enabling this?  If the former, I suggest CONFIG_RCU_NOCB_CPU_ALL,
but if distros are doing this for general-purpose workloads, I instead
suggest CONFIG_RCU_FAST_NO_HZ.


thermal overload happens a lot on small devices, but sadly also in big 
datacenters
where it is not uncommon to underprovision cooling capacity by a bit
(it's one of those "99% of the time you only need THIS much, the 1% you need 30% 
more"
and that more is expensive or even impractical)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 7:32 AM, Jacob Pan wrote:

On Thu, 5 Nov 2015 15:33:32 +0100
Peter Zijlstra  wrote:


On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency,
esp. with forced idleness such as this. That's not going to be
popular for many workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low
frequency, reduced memory speeds, probably dimmed your screen etc
etc.


Just to clarify, the low frequency here is not necessarily the minimum
frequency. It is usually the Pe (max efficiency).


to translate that from Intelese to English:
The system already is at the lowest frequency that's relatively efficient. To 
go even lower in instant power
consumption (e.g. heat) by even a little bit, a LOT of frequency needs to be 
sacrificed.

Idle injection sucks. But it's more efficient (at the point that it would get 
used) than any other methods,
so it also sucks less than those other methods for the same amount of reduction 
in heat generation.
It only gets used if the system HAS to reduce the heat generation, either 
because it's a mobile device with
little cooling capacity, or because the airconditioning in your big datacenter 
is currently
not able to keep up.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 6:33 AM, Peter Zijlstra wrote:

On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low frequency,
reduced memory speeds, probably dimmed your screen etc etc.

at this point there are 3 choices
1) Shut off the device
2) do uncoordinated idle injection for 40% of the time
3) do coordinated idle injection for 5% of the time

as much as force injecting idle in a synchronized way sucks, the alternatives 
are worse.


OK, it wasn't put that way. I figured it was a way to use less power on
any workload with idle time on.


so idle injection (as with pretty much every thermal management feature) is NOT 
a way to save
on battery life. Every known method pretty much ends up sacrificing more in 
terms of performance
than you gain in instant power that over time you end up using more (drain 
battery basically).

idle injection, if synchronized, is one of the more effective ones, e.g. give 
up the least efficiency
compared to, say, unsynchronized or even inserting idle cycles in the CPU 
(T-states)...
not even speaking of just turning the system off.




That said; what kind of devices are we talking about here; mobile with
pittyful heat dissipation? Surely a well designed server or desktop
class system should never get into this situation in the first place.


a well designed server may not, but the datacenter it is in may.
for example if the AC goes out, but also, sometimes the datacenter's peak heat 
dissapation
can exceed the AC capacity (which is outside temperature dependent.. yay global 
warming),
which may require an urgent reduction over a series of machines for the 
duration of the peak load/peak temperature
(usually just inserting a little bit, say 1%, over all servers will do)



It just grates at me a bit that we have to touch hot paths for such

scenarios :/

well we have this as a driver right now that does not touch hot paths,
but it seems you and tglx also hate that approach with a passion




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low frequency,
reduced memory speeds, probably dimmed your screen etc etc.

at this point there are 3 choices
1) Shut off the device
2) do uncoordinated idle injection for 40% of the time
3) do coordinated idle injection for 5% of the time

as much as force injecting idle in a synchronized way sucks, the alternatives 
are worse.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] cpuidle,menu: smooth out measured_us calculation

2015-11-04 Thread Arjan van de Ven

On 11/3/2015 2:34 PM, r...@redhat.com wrote:


Furthermore, for smaller sleep intervals, we know the chance that
all the cores in the package went to the same idle state are fairly
small. Dividing the measured_us by two, instead of subtracting the
full exit latency when hitting a small measured_us, will reduce the
error.


there is no perfect answer for this issue; but at least this makes the situation
a lot better, so

Acked-by: Arjan van de Ven 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] cpuidle,x86: increase forced cut-off for polling to 20us

2015-11-04 Thread Arjan van de Ven

Acked-by: Arjan van de Ven 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] cpuidle,menu: use interactivity_req to disable polling

2015-11-04 Thread Arjan van de Ven

On 11/3/2015 2:34 PM, r...@redhat.com wrote:

From: Rik van Riel 

The menu governor carefully figures out how much time we typically
sleep for an estimated sleep interval, or whether there is a repeating
pattern going on, and corrects that estimate for the CPU load.

Then it proceeds to ignore that information when determining whether
or not to consider polling. This is not a big deal on most x86 CPUs,
which have very low C1 latencies, and the patch should not have any
effect on those CPUs.

However, certain CPUs (eg. Atom) have much higher C1 latencies, and
it would be good to not waste performance and power on those CPUs if
we are expecting a very low wakeup latency.

Disable polling based on the estimated interactivity requirement, not
on the time to the next timer interrupt.


good catch!

Acked-by: Arjan van de Ven 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86/mm: Warn on W^X mappings

2015-10-08 Thread Arjan van de Ven

On 10/8/2015 7:57 AM, Borislav Petkov wrote:

+   pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages 
found.\n");

Do we really want to issue anything here in the success case? IMO, we
should be quiet if the check passes and only scream when something's
wrong...


I would like the success message to be there.
From an automated testing perspective (for the distro I work on for example),

"the test runs and it fails",
"the test runs and it passes" and
"the test has not run (because of a bug in the code or config file)"

are different outcomes, where the first and third are test failures,
but without the pr_info at info level, the 2nd and 3rd are indistinguishable.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/mm] x86/mm: Warn on W^X mappings

2015-10-06 Thread Arjan van de Ven

On 10/6/2015 2:54 AM, tip-bot for Stephen Smalley wrote:

Commit-ID:  e1a58320a38dfa72be48a0f1a3a92273663ba6db
Gitweb: http://git.kernel.org/tip/e1a58320a38dfa72be48a0f1a3a92273663ba6db
Author: Stephen Smalley 
AuthorDate: Mon, 5 Oct 2015 12:55:20 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 6 Oct 2015 11:11:48 +0200

x86/mm: Warn on W^X mappings

Warn on any residual W+X mappings after setting NX
if DEBUG_WX is enabled.  Introduce a separate
X86_PTDUMP_CORE config that enables the code for
dumping the page tables without enabling the debugfs
interface, so that DEBUG_WX can be enabled without
exposing the debugfs interface.  Switch EFI_PGT_DUMP
to using X86_PTDUMP_CORE so that it also does not require
enabling the debugfs interface.


I like it, so Acked-by: Arjan van de Ven 

I also have/had an old userland script to do similar checks but using the 
debugfs interface...
... would that be useful to have somewhere more central?

http://git.fenrus.org/tmp/i386-check-pagetables.pl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Arjan van de Ven

On 9/21/2015 9:36 AM, Linus Torvalds wrote:

On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar  wrote:


Linus, what's your preference?


So quite frankly, is there any reason we don't just implement
native_read_msr() as just

unsigned long long native_read_msr(unsigned int msr)
{
   int err;
   unsigned long long val;

   val = native_read_msr_safe(msr, &err);
   WARN_ON_ONCE(err);
   return val;
}

Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
done with it. I don't see the downside.

How many msr reads are so critical that the function call
overhead would matter?


if anything qualifies it'd be switch_to() and friends.

note that I'm not entirely happy about the notion of "safe" MSRs.
They're safe as in "won't fault".
Reading random MSRs isn't a generic safe operation though, but the name sort of 
gives people
the impression that it is. Even with _safe variants, you still need to KNOW the 
MSR exists (by means
of CPUID or similar) unfortunately.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-17 Thread Arjan van de Ven

On 9/17/2015 8:29 AM, Paolo Bonzini wrote:



On 17/09/2015 17:27, Arjan van de Ven wrote:



( We should double check that rdmsr()/wrmsr() results are never left
uninitialized, but are set to zero or so, for cases where the
return code is not
checked. )


It sure looks like native_read_msr_safe doesn't clear the output if
the rdmsr fails.


I'd suggest to return some poison not just 0...


What about 0 + WARN?


why 0?

0xdeadbeef or any other pattern (even 0x3636363636) makes more sense (of course 
also WARN... but most folks don't read dmesg for WARNs)

(it's the same thing we do for list or slab poison stuff)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-17 Thread Arjan van de Ven



( We should double check that rdmsr()/wrmsr() results are never left
   uninitialized, but are set to zero or so, for cases where the return code is 
not
   checked. )


It sure looks like native_read_msr_safe doesn't clear the output if
the rdmsr fails.


I'd suggest to return some poison not just 0...
less likely to get interesting surprises that are insane hard to debug/diagnose



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: V4.0.x fails to create /dev/rtc0 on Winbook TW100 when CONFIG_PINCTRL_BAYTRAIL is set, bisected to commit 7486341

2015-07-11 Thread Arjan van de Ven

On 7/11/2015 11:26 AM, Porteus Kiosk wrote:

Hello Arjan,

We need it for setting up the time in the hardware clock through the 'hwclock' 
command.

Thank you.



hmm thinking about it after coffee... there is an RTC that can be exposed to 
userspace.
hrmpf. Wonder why its not there for you




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: V4.0.x fails to create /dev/rtc0 on Winbook TW100 when CONFIG_PINCTRL_BAYTRAIL is set, bisected to commit 7486341

2015-07-11 Thread Arjan van de Ven

On 7/11/2015 11:21 AM, Arjan van de Ven wrote:

On 7/11/2015 10:59 AM, Larry Finger wrote:

On a Winbook TW100 BayTrail tablet, kernel 4.0 and later do not create 
/dev/rtc0 when CONFIG_PINCTRL_BAYTRAIL is set in the configuration. Removing 
this option from the
config creates a real-time clock; however, it is no longer possible to get the 
tablet to sleep using the power button. Only complete shutdown works.

This problem was bisected to the following commit:


in "hardware reduced mode" (e.g. tablets) on Baytrail the RTC is not actually 
enabled/initialized by the firmware; talking to it may appear to work but it's really not
a good idea (and breaks things likes suspend/resume etc).


(or in other words, many of the legacy PC things are not supposed to be there)

what did you want to use rtc0 for?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >